小狐狸事務所: 7月 2024

2024年7月31日星期三

參加 Google Cloud OnBoard

下午參加 Google Cloud OnBoard 線上課程 :

首頁 - Google Cloud OnBoard：AI 與機器學習 (cloudonair.withgoogle.com)

此課程屬於 Google ML/AI 入門, 沒有實作部分, 主要是說明 Google 在機器學習與人工智慧方面所提供的全方位服務, 除可客製化訓練自己的模型, 也提供預訓練的大模型, 以及 No code/Low code 工具, 更多資訊參考 QR code :

下面網址展示 Google 自然語言處理的分析功能 :

# https://cloud.google.com/natural-language?hl=zh_tw#demo

# Google Cloud OnBoard：AI 與機器學習 | 模組二：AI 開發選項

好書 : Build a Large Language Model (From Scratch)

前陣子在陳南宗 fb 看到大師推薦了 Sebastian Raschka 寫的 LLM 巨著 :

# Build a Large Language Model (From Scratch) US$23.99

Source : Manning

目前是 MEAP 嘗鮮版電子書, 美金 24 元也不算貴, 但就算買了我現在也沒時間看 (最近都還在消化 2018 年買的好幾本 Python 爬蟲書呢), 所以先記下來等正式版出來再說.

2024年7月29日星期一

C 碟空間用盡可能是 Windows 更新檔塞爆

兩周前發現鄉下家的 Lemel 老電腦顯示 C 碟僅剩 3GB 不到, 想下載大一點的檔案都不行, 奇怪, 以前平時都有 40GB 以上啊!

一直以為是因為有安裝 OneDrive, 它會在 C 碟有一個 OneDrive Personal 資料夾與雲端同步, 想說將其移到 D 碟, 但實際要移卻發現沒那麼容易, 爬文找到下面這篇介紹傲梅分區助手這套軟體 :

# 釋放C槽空間：快速搬遷OneDrive至D槽的2種方法！

價格不貴, 兩台機器終身版還不到 2000 元, 先去官網下載試用版 :

它可以選擇要移動哪個應用程式到 D 碟下 :

但是到最後一步才顯示試用版無法使用應用程式遷移器, 殘念 ~~~ :

# AOMEI Partition Assistant Professional

雖然價格還算 OK, 但還是戒急用忍. 我又找到下面這款 4DDigs 來試試看, 但很可惜它只能搬整個 C 碟到其它硬碟, 沒有應用程式遷移器這種功能 :

# C槽空間不足？告訴你３個C槽清理方法！

後來發現右下角有 Windows 更新提示, 突然想到我的電腦長久以來都不讓微軟直接更新, 我都設定成有更新時再通知我決定要不要更新 (前不久 Windows 全球大當機就是明證), 懷疑該不會是系統已自動下載更新檔, 然後才通知我要不要執行更新?

抱著姑且試試看心態重新啟動電腦, 重開機後檢查檔案總管, 果然 C 碟重回 41GB :

所以 C 碟空間用盡時很有可能是被 Windows 更新檔塞爆所致, 困擾我兩周的問題終於解決啦!

參考 :

# Windows C槽空間不足問題

2024年7月28日星期日

本周三因為凱米颱風來襲連續放了三天颱風假, 出乎意料強大的風勢與雨量致使中南部各地淹大水, 使農漁民蒙受巨大損失. 所幸我家位於高處皆無淹水之害, 但我周末回到鄉下, 發現菜園五棵木瓜樹均被吹倒, 看到正開花結果的木瓜樹東倒西歪真是讓我心裡淌血, 那些魚塭被沖毀淹沒的養殖戶心裡之痛無法想像. 週六一早小舅來菜園已把傾倒之木瓜樹扶起並用木條撐住固定, 但到了中午卻看到葉子往下垂, 能否重生還是未定之天 :

田埂上這棵原本是筆直的, 居然被風吹彎成這樣, 可見風勢多強勁 :

昨日新聞說有一養殖戶安裝了 9KW 的儲能系統, 但連續三天風雨電力用盡, 市電又被吹斷無法讓打氣幫浦運行導致所養魚類翻肚心血全毀, 這應該是太大意, 忘記綠能來源不穩定本質而未添置油機所致, 這就是為何去年我會買一台小型四行程汽油發電機的原因, 就是前幾年颱風都遇到電桿傾倒停電長達一日, 傍晚點蠟燭煮菜洗澡的不便所致. 現代人沒電無法生活, 手機第一個就不同意這種事情發生.

週六花了一整個下午清理環境, 先是二樓祖祠前露台布滿的落葉, 主要是屋旁的波蘿蜜樹, 蓮霧樹以及龍眼樹被颱風吹落的葉子, 它們可能會塞住出水口導致露臺積水侵入祖堂內. 減少落葉之道是把靠近屋子的樹枝都修剪掉. 清理完順便把龍眼都採收, 因為下周可能還會生成一個颱風. 由於三天颱風肆虐, 裂果的龍眼蠻多的. 今年這棵龍眼樹只有在南邊靠近祖堂的部分才有開花結果, 其餘方位均無, 實在想不通是何原因.

整理完剛好中華電信工程師來府檢修電話線, 週三颱風那天就故障, 我打回家都聽到忙線中語音, 但爸確定每支分機都有掛斷, 所以應該是外線被颱風弄壞了. 但老家在廣興的這位帥哥查了老半天後說可能是內線分機有問題, 他把通往書房那條線剪掉換了新的分接盒就好了. 不一會兒婷婷帶小孩過來玩, 回去時把採得的龍眼分一半給她帶回去吃.

週日下午持續修剪樹枝, 把靠馬路這邊的土肉桂雨蓮霧樹叉枝都砍除, 避免摩擦電線導致短路. 這個周末真是好累好忙啊! 下周打算來修剪芒果樹了.

2024年7月27日星期六

終於看完 "三國"

三天颱風假終於把中國 2010 年拍 95 的 "三國" 看完了, 我原本不知有此劇, 是五月母親節小狐狸們回鄉下時一起看的, 因二哥說不錯看我就去 YT 搜尋, 發現居然是長達 95 集的大劇 :

# 除國賊曹公獻刀｜【三國】第1集

於是兩個月來晚餐就佐以追劇三國, 每天 1~3 集, 坦白說中國拍古裝歷史劇還蠻考究的, 戰爭場面也非常寫實. 故事從東漢末年黃巾之亂後董卓的西涼軍進入長安掌控朝政開始, 一直到司馬懿死後其孫司馬炎篡魏止, 但七擒孟獲, 滅蜀與滅吳之戰就一筆帶過, 可能是集數過長成本超支.

此劇選角大抵還算不錯, 但演出司馬懿的演員似乎太老, 他比曹操年輕 24 歲, 但出場時扮相卻感覺比曹操還老, 不過我認為此人演技很出色 (特別是他那奇怪與恐怖的笑聲, 這可能是導演選角最主要的考量).

2024年7月22日星期一

2024 年第 29 周記事

這一周因為右腳走動時會痛都沒去河堤快走, 周一去蕭志文看復健科, 跟上回一樣是右腳踝筋拉傷, 開了一支 "易妥痠痛凝膠" 擦患部, 果真擦了兩次就不會痛了, 但為了好恢復暫時不去走路, 僅上下午休息時間做八段錦與四分鐘 Tabata 有氧.

本周仍是 Scrapy 爬蟲延長賽, 連續搞定用 XPath 與 CSS 選擇器剖析網頁, Item pipelines 等四篇測試筆記, 終於快來到終點了, 月底可望能大事底定. 學習過程總是很緩慢, 但慢有慢的好處, 即能深化而不致囫圇吞棗.

最近為了減肥減少水果攝取, 僅剩蘋果, 酪梨與芭樂. 我覺得得芭樂是很棒的水果, 不僅糖分低維他命C 超高, 一年四季都吃得到, 所以決定現在起要好好照料菜園那四棵芭樂樹, 因為連小舅都對其品質讚不絕口. 我週日下午去採了六顆, 順便幫剛成果的套袋, 太久沒套都忘記怎麼套了 :

下周回去時要去資材行買一大包, 每周日下午都來套, 夏天炎熱生長快, 一下子沒套到就會被果蠅攻陷, 我菜園可都沒有在噴農藥的喔 (其實是不想花錢買農藥).

鄉下的貓小白已超兩周沒回來, 上次超過兩個月沒回家我以為從此消失了, 哪知有一天卻突然出現, 餵牠貓糧吃了一大碗. 如果有 GPS 追蹤器就知道跑去哪跑多遠了.

自五月初母親節二哥回鄉下開始跟他們一起看 "三國" 後每天晚餐都在看, 已到 79 集, 關羽大意失荊州敗走麥城被呂蒙斬首, 但吳侯一看臉色大變, 急忙將首級轉送許昌作為曹操壽禮. 關羽就是太傲了, 又不聽馬良的勸才會死於吳下阿蒙之手. 79 集演漢獻帝於前往封地濁鹿 (今河南焦作) 時自鑿船洞與曹皇后死於漳河, 但正史不是這樣, 曹操死於建安 25 年 (西元 220 年), 曹丕於同年底篡位, 封獻帝為山陽公, 他在封地直到西元 234 年才駕崩, 過了整整 14 年無憂無慮的退休生活, 那時連曹丕都已死了, 龍座已是魏明帝曹叡在坐了.

下周如果有空要開始給芒果樹修枝了.

2024年7月19日星期五

Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (五)

在前一篇測試中, 我們利用 Scrapy 提供的 Item 與 Field 類別定義了用來儲存爬蟲目標資料的結構化資料項目類別, 但這只是將目標資料封裝到物件裡面變成結構化資料而已, 如果需要進一步處理結構化資料中某些欄位的內容, 例如去除非英文字元, 金額轉換, 四捨五入到指定位數等資料清理作業, 或者將資料存入資料庫等, 這些與資料項目相關的作業不會在爬蟲程式中處理, 而是交給 Item Pipeline (資料項目管線) 負責. 本篇主要是測試資料清理與儲存至資料庫之資料項目管線作業.

本系列之前的筆記參考 :

# Python 學習筆記 : Selenium 模組瀏覽器自動化測試 (一)

# Python 學習筆記 : Selenium 模組瀏覽器自動化測試 (二)

# Python 學習筆記 : 網頁擷取 (一) 使用 urllib 與 HTMLParser

# Python 學習筆記 : 網頁擷取 (二) 使用 requests 套件下載網頁

# Python 學習筆記 : 網頁擷取 (三) : 使用 BeautifulSoup 剖析網頁

# Python 學習筆記 : 網頁擷取 (四) : 開發網路爬蟲的步驟與工具

# Python 學習筆記 : 網頁擷取 (五) : 安裝 Chrome 擴充套件 Quick Javascript Switcher

# Python 學習筆記 : 網頁爬蟲實戰 (一) 台銀牌告匯率

# Python 學習筆記 : 網頁爬蟲實戰 (二) BBC 金融財經新聞

# Python 學習筆記 : 網頁爬蟲實戰 (三) 證交所休市日期

# Python 學習筆記 : 網頁爬蟲實戰 (四) 台北市公開資料平台 API

# Python 學習筆記 : 網頁爬蟲實戰 (五) 從 OpenWeather 擷取氣象資料

# Python 學習筆記 : 網頁爬蟲實戰 (六) 博客來書店每日一書 66 折網頁

# Python 學習筆記 : 網頁爬蟲實戰 (七) 台股上市櫃公司清單網頁

# Python 學習筆記 : 網頁爬蟲實戰 (八) 台股每日盤後資訊網頁

# Python 學習筆記 : 網頁爬蟲實戰 (九) 市立圖書館個人書房借書資訊 (上)

# Python 學習筆記 : 網頁爬蟲實戰 (九) 市立圖書館個人書房借書資訊 (中)

# Python 學習筆記 : 網頁爬蟲實戰 (九) 市立圖書館個人書房借書資訊 (下)

# Python 學習筆記 : 網頁爬蟲實戰 (十) 高科大圖書館爬蟲

# Python 學習筆記 : Selenium 4 用法 (上)

# Python 學習筆記 : Selenium 4 用法 (中)

# Python 學習筆記 : Selenium 4 用法 (下)

# Python 學習筆記 : 網頁爬蟲實戰 (十一) 集保戶股權分散表

# Python 學習筆記 : 網頁爬蟲實戰 (十二) 國發會景氣對策信號

# Python 學習筆記 : 網頁爬蟲實戰 (十三) 富時中國 A50 期貨指數

# Python 學習筆記 : 網頁爬蟲實戰 (十四) 104 人力銀行的分頁搜尋結果

# Python 學習筆記 : 網頁爬蟲實戰 (十五) NBA 球員分頁資料

# Python 學習筆記 : 網頁爬蟲實戰 (十六) books.toscrape.com 的書籍分頁資料

# Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (一)

# Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (二)

# Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (三)

# Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (四)

十. 使用 Item Pipeline 處理資料項目 :

在前一篇測試中, 爬蟲程式的 parse() 函式的任務只是單純地將爬取到的資料存入資料項目物件中而已 (亦即先把資料收進來), 如果要對 Item 物件中的欄位資料進行清理或處理, 可以透過 Item Pipeline 資料項目管線來達成 (功能模組化的概念).

在 Scrapy 中, Item Pipeline 就是負責資料處理的組件, 具體來說其實就是位於第二層專案目錄下的 pipelines.py 程式檔, 在建立一個新專案時就會自動產生此檔案. 不過 , Item Pipeline 是備選功能, 即使在 pipelines.py 中撰寫了資料項目處理功能, 還必須到 settings.py 設定檔中加入處理管線之類別與設定執行順序才會開啟處理功能.

1. 資料清理的 Item Pipeline :

本篇仍以擷取台銀牌告匯率網站上的 19 種貨幣匯率為例說明, 先建立一個 project4 專案 :

scrapy startproject project4

D:\python\test\scrapy_projects>scrapy startproject project4

New Scrapy project 'project4', using template directory 'C:\Users\tony1\AppData\Local\Programs\Thonny\Lib\site-packages\scrapy\templates\project', created in:

D:\python\test\scrapy_projects\project4

You can start your first spider with:

cd project4

scrapy genspider example example.com

然後從 project3 專案下複製 itemps.py 到 projects4 專案下覆蓋預設的 items.py, 內容不變 :

# items.py

import scrapy

class RateItem(scrapy.Item):

currency=scrapy.Field()

rate=scrapy.Field()

接著從 project3 專案下的 spiders 目錄下複製爬蟲程式 bot_rate_spider.py 到 project4 的 spiders 目錄下, 並修改為從 project4.items 匯入 RateItem 類別, 其他完全不變, 內容如下 :

# bot_rate_spider.py

import scrapy

from project4.items import RateItem # 此處須改為 project4

class RateSpider(scrapy.Spider):

name='bot_rate_spider'

allowed_domains=['rate.bot.com.tw']

start_urls=['https://rate.bot.com.tw/xrt?Lang=zh-TW']

def parse(self, response):

xpath='//tbody/tr/td/div/div[position()=2]/text()'

currency=response.xpath(xpath).getall()

currency=[c.strip() for c in currency]

xpath='//tbody/tr/td[position()=3]/text()'

rate=response.xpath(xpath).getall()

rate_dict={c: r for c, r in zip(currency, rate)}

for c, r in rate_dict.items():

rate_item=RateItem()

rate_item['currency']=c

rate_item['rate']=r

yield rate_item

如此相當於將 project3 專案複製到 project4 來, 兩者執行之功能完全相同, 本篇要在此基礎上添加 Iteem Pipeline 功能.

首先開啟第二層專案目錄下的 pipelines.py 檔, 其內容為建立專案時自動產生 :

# pipelines.py

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface

from itemadapter import ItemAdapter

class Project4Pipeline:

def process_item(self, item, spider):

return item

可知預設已有一個名為 Project4Pipeline 的類別, 此類別名稱是自訂的, 裡面有一個 process_item() 方法, 這就是用來處理資料項目的函式, 會傳入三個參數 : 物件本身 self, 資料項目 item (就是爬蟲程式 yield 出來的每一個 Item 物件), 以及正在執行的爬蟲實例 spider (可用來根據不同的爬蟲進行不同的操作或判斷).

注意, pipelines.py 預設匯入一個 itemadapter.ItemAdaptor 類別, 此調適器類別用來轉換 Item 物件為類似字典之 ItemAdapter 物件, 使其能像操作 Python 字典一樣處理 Item 物件之內容, 例如 :

>>> import scrapy

>>> from itemadapter import ItemAdapter

>>> class MyItem(scrapy.Item):

name=scrapy.Field()

price=scrapy.Field()

>>> item=MyItem(name='Apple', price=10)

>>> adapter=ItemAdapter(item)

>>> type(adapter)

>>> dir(adapter)

['ADAPTER_CLASSES', '_MutableMapping__marker', '__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_asdict', '_get_adapter_class', 'adapter', 'asdict', 'clear', 'field_names', 'get', 'get_field_meta', 'get_field_meta_from_class', 'get_field_names_from_class', 'is_item', 'is_item_class', 'item', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']

可見 ItemAdapter 具有 Python 字典的所有方法 :

>>> adapter.keys()

KeysView(<ItemAdapter for MyItem(name='Apple', price=10)>)

>>> adapter.values()

ValuesView(<ItemAdapter for MyItem(name='Apple', price=10)>)

>>> adapter.items()

ItemsView(<ItemAdapter for MyItem(name='Apple', price=10)>)

安裝 Scrapy 時會同時安裝 itemadapter 模組, 因此可直接匯入使用, 參考 :

# https://pypi.org/project/itemadapter/0.0.3/

本篇測試中首先要對每個資料項目物件的幣別做個處理, 就是將幣別英文代號的括弧去掉, 所以類別名稱可以改為 RemoveParentheses. 然後要在 process_item() 方法中實作資料清理, 移除左右小括弧, 這只需要用鏈式呼叫字串的 replace() 方法兩次即可達成, 改完後用它來更新 Item 物件的 currency 欄位即可, pipelines.py 修改如下 :

# pipelines.py

from itemadapter import ItemAdapter

class RemoveParentheses: # 自訂的類別名稱

def process_item(self, item, spider):

currency=item['currency'].replace('(', '').replace(')', '') # 去除

item['currency']=currency # 更新 currency 欄位

return item

注意, 一個專案中可以定義多個 Item Pipeline 管線類別來執行不同之資料項目處理, 但這些管線類別必須在 settings.py 設定檔的 ITEM_PIPELINES 字典中設定才會生效, 亦即添加一筆以該類別名稱設為鍵, 以及執行順序為值的字典項目, 這樣 pipelines.py 中的管線類別才會被執行.

當建立新專案時, Item pipeline 的 Item Pipeline 功能預設是關閉的, 開啟 settings.py 後搜尋 'Pipeline' 可得如下預設被取消套用之 ITEM_PIPELINES 設定 :

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

#ITEM_PIPELINES = {

# "project4.pipelines.Project3Pipeline": 300,

將 ITEM_PIPELINES 字典前面的 # 拿掉並修改類別名稱為 pipelines.py 程式中所定義的 RemoveParentheses 類別並設定其執行順序才能開啟 Item Pipeline 的功能, 如下所示 :

ITEM_PIPELINES = {

"project4.pipelines.RemoveParentheses": 300,

}

後面的值範圍為 1~1000, 當 pipelines.py 中定義了多個項目處理類別時, ITEM_PIPELINES 內也必須設定多個鍵值對, 這些資料處理作業的執行順序即由此值的大小決定, 值小的先執行.

這樣就可以來執行爬蟲程式了 :

scrapy crawl bot_rate_spider -o data.json

D:\python\test\scrapy_projects\project4>scrapy crawl bot_rate_spider -o data.json

2024-07-19 19:46:35 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: project4)

2024-07-19 19:46:35 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.22631-SP0

2024-07-19 19:46:35 [scrapy.addons] INFO: Enabled addons:

[]

2024-07-19 19:46:35 [asyncio] DEBUG: Using selector: SelectSelector

2024-07-19 19:46:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor

2024-07-19 19:46:35 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop

2024-07-19 19:46:35 [scrapy.extensions.telnet] INFO: Telnet Password: 4e7f47f963c14a9a

2024-07-19 19:46:35 [scrapy.middleware] INFO: Enabled extensions:

['scrapy.extensions.corestats.CoreStats',

'scrapy.extensions.telnet.TelnetConsole',

'scrapy.extensions.feedexport.FeedExporter',

'scrapy.extensions.logstats.LogStats']

2024-07-19 19:46:35 [scrapy.crawler] INFO: Overridden settings:

{'BOT_NAME': 'project4',

'FEED_EXPORT_ENCODING': 'utf-8',

'NEWSPIDER_MODULE': 'project4.spiders',

'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',

'ROBOTSTXT_OBEY': True,

'SPIDER_MODULES': ['project4.spiders'],

'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}

2024-07-19 19:46:35 [scrapy.middleware] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',

'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

'scrapy.downloadermiddlewares.retry.RetryMiddleware',

'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',

'scrapy.downloadermiddlewares.stats.DownloaderStats']

2024-07-19 19:46:35 [scrapy.middleware] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

'scrapy.spidermiddlewares.referer.RefererMiddleware',

'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

'scrapy.spidermiddlewares.depth.DepthMiddleware']

2024-07-19 19:46:35 [scrapy.middleware] INFO: Enabled item pipelines:

['project4.pipelines.RemoveParentheses']

2024-07-19 19:46:35 [scrapy.core.engine] INFO: Spider opened

2024-07-19 19:46:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2024-07-19 19:46:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

2024-07-19 19:46:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/robots.txt> (referer: None)

2024-07-19 19:46:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/xrt?Lang=zh-TW> (referer: None)

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '美金 USD', 'rate': '33'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '港幣 HKD', 'rate': '4.239'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '英鎊 GBP', 'rate': '43.17'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '澳幣 AUD', 'rate': '22.3'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '加拿大幣 CAD', 'rate': '24.29'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '新加坡幣 SGD', 'rate': '24.68'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '瑞士法郎 CHF', 'rate': '37.21'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '日圓 JPY', 'rate': '0.2115'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '南非幣 ZAR', 'rate': '-'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '瑞典幣 SEK', 'rate': '3.21'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '紐元 NZD', 'rate': '20.08'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '泰幣 THB', 'rate': '0.9636'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '菲國比索 PHP', 'rate': '0.6271'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '印尼幣 IDR', 'rate': '0.00238'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '歐元 EUR', 'rate': '36.15'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '韓元 KRW', 'rate': '0.02574'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '越南盾 VND', 'rate': '0.00146'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '馬來幣 MYR', 'rate': '7.482'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '人民幣 CNY', 'rate': '4.562'}

2024-07-19 19:46:35 [scrapy.core.engine] INFO: Closing spider (finished)

2024-07-19 19:46:35 [scrapy.extensions.feedexport] INFO: Stored json feed (19 items) in: data.json

2024-07-19 19:46:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 745,

'downloader/request_count': 2,

'downloader/request_method_count/GET': 2,

'downloader/response_bytes': 137734,

'downloader/response_count': 2,

'downloader/response_status_count/200': 2,

'elapsed_time_seconds': 0.39827,

'feedexport/success_count/FileFeedStorage': 1,

'finish_reason': 'finished',

'finish_time': datetime.datetime(2024, 7, 19, 11, 46, 35, 833850, tzinfo=datetime.timezone.utc),

'item_scraped_count': 19,

'log_count/DEBUG': 24,

'log_count/INFO': 11,

'response_received_count': 2,

'robotstxt/request_count': 1,

'robotstxt/response_count': 1,

'robotstxt/response_status_count/200': 1,

'scheduler/dequeued': 1,

'scheduler/dequeued/memory': 1,

'scheduler/enqueued': 1,

'scheduler/enqueued/memory': 1,

'start_time': datetime.datetime(2024, 7, 19, 11, 46, 35, 435580, tzinfo=datetime.timezone.utc)}

2024-07-19 19:46:35 [scrapy.core.engine] INFO: Spider closed (finished)

開啟輸出檔 data.json 內容如下 :

[

{"currency": "美金 USD", "rate": "33"},

{"currency": "港幣 HKD", "rate": "4.239"},

{"currency": "英鎊 GBP", "rate": "43.17"},

{"currency": "澳幣 AUD", "rate": "22.3"},

{"currency": "加拿大幣 CAD", "rate": "24.28"},

{"currency": "新加坡幣 SGD", "rate": "24.68"},

{"currency": "瑞士法郎 CHF", "rate": "37.21"},

{"currency": "日圓 JPY", "rate": "0.2115"},

{"currency": "南非幣 ZAR", "rate": "-"},

{"currency": "瑞典幣 SEK", "rate": "3.2"},

{"currency": "紐元 NZD", "rate": "20.08"},

{"currency": "泰幣 THB", "rate": "0.9631"},

{"currency": "菲國比索 PHP", "rate": "0.6271"},

{"currency": "印尼幣 IDR", "rate": "0.00238"},

{"currency": "歐元 EUR", "rate": "36.15"},

{"currency": "韓元 KRW", "rate": "0.02573"},

{"currency": "越南盾 VND", "rate": "0.00146"},

{"currency": "馬來幣 MYR", "rate": "7.482"},

{"currency": "人民幣 CNY", "rate": "4.56"}

]

可見幣別 (currency) 中的括號已經被 Item Pipeline 去除了.

以上測試之專案壓所檔可從 GitHub 下載 :

# https://github.com/tony1966/tony1966.github.io/blob/master/test/python/web_crawler/scrapy_project4_1.zip

2. 輸出至資料庫的 Item Pipeline (寫法 1) :

在上面的 project4 專案基礎上, 我們想在爬取到目標資料, 經過上面的資料清理管線去除 currency 欄位的括號後, 將資料項目存入 SQLite 資料庫, 這必須在 pipelines.py 裡面添加一個類別來處理, 除了 process_item() 方法外, 還需定義 open_spider() 與 close_spider() 方法, 功能如下 :

open_spider() : 連接資料庫 & 建立資料表
process_item() : 處理資料項目 & 執行 SQL 指令
close_spider() : 完成資料庫交易 & 關閉資料庫連線

操作 SQLite 資料庫需要匯入 Python 內建的 sqlite3 套件, SQLite 是一個輕量級的關聯式資料庫, 與微軟的 ACCESS 資料庫一樣為單一檔案資料庫, 備份保存與轉移非常容易, 無需伺服器連線即可運作, 用法參考 :

# Python 學習筆記 : 資料庫存取測試 (一) SQLite

修改第二層專案目錄 project4 下的 pipelines.py, 先匯入 sqlite3 套件, 然後新增一個處理管線類別 Save2SQLite 並定義上述的三個方法 :

# pipelines.py

from itemadapter import ItemAdapter

import sqlite3

class RemoveParentheses:

def process_item(self, item, spider):

currency=item['currency'].replace('(', '').replace(')', '') # 去除左右括號

item['currency']=currency # 更新幣別欄位值

return item # 回傳資料項目

class Save2SQLite:

def open_spider(self, spider):

self.conn=sqlite3.connect('bot_rate_spider.sqlite') # 連接資料庫檔

self.cur=self.conn.cursor() # 建立 Cursor 物件

SQL='CREATE TABLE if not exists bot_rate(' +\

'currency TEXT, rate TEXT)' # 若資料表 bot_rate 不存在就建立

self.cur.execute(SQL) # 建立資料表

def process_item(self, item, spider):

SQL='INSERT INTO bot_rate(currency, rate) VALUES("' +\

item['currency'] + '","' + item['rate'] + '")' # SQL 插入指令

self.cur.execute(SQL) # 新增紀錄

return item # 回傳資料項目

def close_spider(self, spider):

self.conn.commit() # 執行資料庫交易

self.conn.close() # 關閉資料庫連線

注意, 由於 currency 與 rate 欄位型態均為 TEXT, 所以值必須用括號括起來, 否則執行時會出現錯誤. 其次, 在 open_spider() 方法中指定了資料庫檔 bot_rate_spider.sqlite, 但不需要預先手動建立此空的資料庫檔, 爬蟲程式執行時會自動在第一層專案目錄 project4 下建立此 .sqlite 檔.

接著必須修改專案設定檔 settings.py 中的 ITEM_PIPELINES 字典, 添加新增的處理管線類別 Save2SQLite 與其執行順序編號組成的鍵值對 :

# settings.py

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

"project4.pipelines.RemoveParentheses": 300, # 先執行

"project4.pipelines.Save2SQLite": 400, # 後執行

}

此處順序值範圍為 1~1000, Save2SQLite 的順序比 RemoveParentheses 大表示執行順序較後, 因此會先執行 RemoveParentheses 刪除括號後再進行儲存至資料表的動作.

執行結果如下 :

scrapy crawl bot_rate_spider -o data.json

D:\python\test\scrapy_projects\project4>scrapy crawl bot_rate_spider -o data.json

2024-07-20 10:21:37 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: project4)

2024-07-20 10:21:37 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.22631-SP0

2024-07-20 10:21:37 [scrapy.addons] INFO: Enabled addons:

[]

2024-07-20 10:21:37 [asyncio] DEBUG: Using selector: SelectSelector

2024-07-20 10:21:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor

2024-07-20 10:21:37 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop

2024-07-20 10:21:37 [scrapy.extensions.telnet] INFO: Telnet Password: 866130966a0be4e8

2024-07-20 10:21:37 [scrapy.middleware] INFO: Enabled extensions:

['scrapy.extensions.corestats.CoreStats',

'scrapy.extensions.telnet.TelnetConsole',

'scrapy.extensions.feedexport.FeedExporter',

'scrapy.extensions.logstats.LogStats']

2024-07-20 10:21:37 [scrapy.crawler] INFO: Overridden settings:

{'BOT_NAME': 'project4',

'FEED_EXPORT_ENCODING': 'utf-8',

'NEWSPIDER_MODULE': 'project4.spiders',

'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',

'ROBOTSTXT_OBEY': True,

'SPIDER_MODULES': ['project4.spiders'],

'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}

2024-07-20 10:21:37 [scrapy.middleware] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',

'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

'scrapy.downloadermiddlewares.retry.RetryMiddleware',

'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',

'scrapy.downloadermiddlewares.stats.DownloaderStats']

2024-07-20 10:21:37 [scrapy.middleware] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

'scrapy.spidermiddlewares.referer.RefererMiddleware',

'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

'scrapy.spidermiddlewares.depth.DepthMiddleware']

2024-07-20 10:21:37 [scrapy.middleware] INFO: Enabled item pipelines:

['project4.pipelines.RemoveParentheses', 'project4.pipelines.Save2SQLite']

2024-07-20 10:21:37 [scrapy.core.engine] INFO: Spider opened

2024-07-20 10:21:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2024-07-20 10:21:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

2024-07-20 10:21:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/robots.txt> (referer: None)

2024-07-20 10:21:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/xrt?Lang=zh-TW> (referer: None)

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '美金 USD', 'rate': '33'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '港幣 HKD', 'rate': '4.239'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '英鎊 GBP', 'rate': '43.23'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '澳幣 AUD', 'rate': '22.27'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '加拿大幣 CAD', 'rate': '24.26'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '新加坡幣 SGD', 'rate': '24.66'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '瑞士法郎 CHF', 'rate': '37.2'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '日圓 JPY', 'rate': '0.2115'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '南非幣 ZAR', 'rate': '-'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '瑞典幣 SEK', 'rate': '3.2'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '紐元 NZD', 'rate': '20.04'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '泰幣 THB', 'rate': '0.9618'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '菲國比索 PHP', 'rate': '0.6266'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '印尼幣 IDR', 'rate': '0.00238'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '歐元 EUR', 'rate': '36.14'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '韓元 KRW', 'rate': '0.02571'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '越南盾 VND', 'rate': '0.00146'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '馬來幣 MYR', 'rate': '7.482'}

2024-07-20 10:21:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '人民幣 CNY', 'rate': '4.559'}

2024-07-20 10:21:38 [scrapy.core.engine] INFO: Closing spider (finished)

2024-07-20 10:21:38 [scrapy.extensions.feedexport] INFO: Stored json feed (19 items) in: data.json

2024-07-20 10:21:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 745,

'downloader/request_count': 2,

'downloader/request_method_count/GET': 2,

'downloader/response_bytes': 137732,

'downloader/response_count': 2,

'downloader/response_status_count/200': 2,

'elapsed_time_seconds': 0.455029,

'feedexport/success_count/FileFeedStorage': 1,

'finish_reason': 'finished',

'finish_time': datetime.datetime(2024, 7, 20, 2, 21, 38, 41126, tzinfo=datetime.timezone.utc),

'item_scraped_count': 19,

'log_count/DEBUG': 24,

'log_count/INFO': 11,

'response_received_count': 2,

'robotstxt/request_count': 1,

'robotstxt/response_count': 1,

'robotstxt/response_status_count/200': 1,

'scheduler/dequeued': 1,

'scheduler/dequeued/memory': 1,

'scheduler/enqueued': 1,

'scheduler/enqueued/memory': 1,

'start_time': datetime.datetime(2024, 7, 20, 2, 21, 37, 586097, tzinfo=datetime.timezone.utc)}

2024-07-20 10:21:38 [scrapy.core.engine] INFO: Spider closed (finished)

執行完畢後在第一層專案目錄 project4 下會出現一個 bot_rate_spider.sqlite 檔案, 可以將此檔案上傳至線上軟體 SQLite Viewer 上檢視其內容 :

# https://inloop.github.io/sqlite-viewer/

可見 Save2SQLite 處理管線確實已將資料項目都寫入 SQLite 資料庫了 (這是 Save2SQLite 管線類別負責的), 注意, currency 欄位值都沒有括號, 這是 RemoveParentheses 管線類別將其去除的. 也可以下載應用程式 DB Browser for SQLite 來管理與檢視 SQLite 檔案, 參考 :

# Python 學習筆記 : DB Browser for SQLite

以上測試的 project4 專案壓縮檔可在 GitHub 下載 :

# https://github.com/tony1966/tony1966.github.io/blob/master/test/python/web_crawler/scrapy_project4_2.zip

3. 輸出至資料庫的 Item Pipeline (寫法 2) :

上面的 pipelines.py 中資料庫檔名寫死在 open_spider() 方法裡面, 其實可以寫在專案設定檔的一個自訂常數中, 然後在 open_spider() 中利用 spider 參數的 get() 方法將資料褲檔名讀進來, 這樣需要改資料庫檔名時只要改 settings.py 即可, 不必去改 pipelines.py.

首先在 settings.py 中添加一個自訂常數指定資料庫檔名, 例如 SQLITE_DATABASE :

ITEM_PIPELINES = {

"project4.pipelines.RemoveParentheses": 300,

"project4.pipelines.Save2SQLite": 400,

}

SQLITE_DATABASE='spider_db.sqlite' # 自訂常數

然後修改 pipelines.py 中 Save2SQLite 管線類別的 open_spider() 方法, 將原本寫死資料庫檔名的寫法改成從 settings.py 中讀取自訂常數 SQLITE_DATABASE 之值, 修改後的 open_spider() 方法如下 :

def open_spider(self, spider):

db=spider.settings.get('SQLITE_DATABASE')

self.conn=sqlite3.connect(db)

self.cur=self.conn.cursor()

SQL='CREATE TABLE if not exists bot_rate(' +\

'currency TEXT, rate TEXT)'

self.cur.execute(SQL)

因此我們可以將一些設定值 (例如資料庫連線的 host 位址, 連線帳密等) 寫在 settings.py 中統一管理, 然後在 pipelines.py 中用 spider.settings.get() 取出使用.

另外, 我在 "Scrapy 一本就精通" 這本書裡看到寫入 SQLite 資料表的另類寫法, 不是製作完整的含值 INSERT INTO 指令, 而是將要插入的值放在 tuple 中, 插入指令使用 ? 替換字符表示這些值, 然後在呼叫 execute() 方法時將 tuple 代入, 修改後的 process_items() 方法如下 :

def process_item(self, item, spider):

VALUES=(item['currency'], item['rate'])

SQL='INSERT INTO bot_rate VALUES(?, ?)'

self.cur.execute(SQL, VALUES)

return item

新寫法的完整 pipelines.py 如下 :

from itemadapter import ItemAdapter

import sqlite3

class RemoveParentheses:

def process_item(self, item, spider):

currency=item['currency'].replace('(', '').replace(')', '')

item['currency']=currency

return item

class Save2SQLite:

def open_spider(self, spider):

db=spider.settings.get('SQLITE_DATABASE') # 讀取 settings.py 中的設定值

self.conn=sqlite3.connect(db)

self.cur=self.conn.cursor()

SQL='CREATE TABLE if not exists bot_rate(' +\

'currency TEXT, rate TEXT)'

self.cur.execute(SQL)

def process_item(self, item, spider):

VALUES=(item['currency'], item['rate']) # 要插入的值放在 tuple 內

SQL='INSERT INTO bot_rate VALUES(?, ?)' # 插入的值用 ? 表示

self.cur.execute(SQL, VALUES)

return item

def close_spider(self, spider):

self.conn.commit()

self.conn.close()

重新執行專案會在第一層專案目錄 project4 下產生資料庫檔 spider_db.sqlite, 檢視其內容與上面完全一樣. 我覺得這種寫法不用處理字串欄位問題很方便.

此寫法的專案壓縮檔下載網址 :

# https://github.com/tony1966/tony1966.github.io/blob/master/test/python/web_crawler/scrapy_project4_3.zip

Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (四)

在前面的測試中我們用 Scrapy 製作了一個網頁爬蟲來抓取台銀牌告匯率網站的資料, 但只簡單地將擷取到的目標資料以字典方式傳回或輸出, 雖然這樣就能完成基本的爬蟲任務, 但並沒有充分利用 Scrapy 專案系統架構的優勢. 因為網頁提供的是非結構性資料, 網頁爬蟲除了擷取資料外, 還必須將其轉成結構化資料儲存於資料庫, 因為結構化資料最適合進行統計分析, 大多數的資料科學分析都是在結構化資料上完成的.

在前面的 Scrapy 測試中透過 -o 選項可將擷取到的目標資料 (字典) 儲存 json 檔, 雖然字典或 JSON 資料也算是一種結構性資料, 但是在程式中無法清楚看出資料欄位而影響可讀性, 此外字典或 JSON 資料缺乏欄名之檢查容易發生錯誤, 因此 Scrapy 架構提供了 Item 物件來封裝資料欄位, 並與管線 (pipelines) 結合使得儲存到關聯式資料庫時 (例如 SQLite) 更方便.

本系列之前的筆記參考 :

# Python 學習筆記 : Selenium 模組瀏覽器自動化測試 (一)

# Python 學習筆記 : Selenium 模組瀏覽器自動化測試 (二)

# Python 學習筆記 : 網頁擷取 (一) 使用 urllib 與 HTMLParser

# Python 學習筆記 : 網頁擷取 (二) 使用 requests 套件下載網頁

# Python 學習筆記 : 網頁擷取 (三) : 使用 BeautifulSoup 剖析網頁

# Python 學習筆記 : 網頁擷取 (四) : 開發網路爬蟲的步驟與工具

# Python 學習筆記 : 網頁擷取 (五) : 安裝 Chrome 擴充套件 Quick Javascript Switcher

# Python 學習筆記 : 網頁爬蟲實戰 (一) 台銀牌告匯率

# Python 學習筆記 : 網頁爬蟲實戰 (二) BBC 金融財經新聞

# Python 學習筆記 : 網頁爬蟲實戰 (三) 證交所休市日期

# Python 學習筆記 : 網頁爬蟲實戰 (四) 台北市公開資料平台 API

# Python 學習筆記 : 網頁爬蟲實戰 (五) 從 OpenWeather 擷取氣象資料

# Python 學習筆記 : 網頁爬蟲實戰 (六) 博客來書店每日一書 66 折網頁

# Python 學習筆記 : 網頁爬蟲實戰 (七) 台股上市櫃公司清單網頁

# Python 學習筆記 : 網頁爬蟲實戰 (八) 台股每日盤後資訊網頁

# Python 學習筆記 : 網頁爬蟲實戰 (九) 市立圖書館個人書房借書資訊 (上)

# Python 學習筆記 : 網頁爬蟲實戰 (九) 市立圖書館個人書房借書資訊 (中)

# Python 學習筆記 : 網頁爬蟲實戰 (九) 市立圖書館個人書房借書資訊 (下)

# Python 學習筆記 : 網頁爬蟲實戰 (十) 高科大圖書館爬蟲

# Python 學習筆記 : Selenium 4 用法 (上)

# Python 學習筆記 : Selenium 4 用法 (中)

# Python 學習筆記 : Selenium 4 用法 (下)

# Python 學習筆記 : 網頁爬蟲實戰 (十一) 集保戶股權分散表

# Python 學習筆記 : 網頁爬蟲實戰 (十二) 國發會景氣對策信號

# Python 學習筆記 : 網頁爬蟲實戰 (十三) 富時中國 A50 期貨指數

# Python 學習筆記 : 網頁爬蟲實戰 (十四) 104 人力銀行的分頁搜尋結果

# Python 學習筆記 : 網頁爬蟲實戰 (十五) NBA 球員分頁資料

# Python 學習筆記 : 網頁爬蟲實戰 (十六) books.toscrape.com 的書籍分頁資料

# Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (一)

# Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (二)

# Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (三)

九. 使用 Item 物件定義結構化資料項目 :

Scrapy 的結構化資料封裝主要是透過 Item 與 Field 類別, Item 用來定義資料項目, 而 Field 則用來定義資料欄位, 在建立專案時, 於第二層專案目錄下的 items.py 檔就是用來定義結構化資料欄位的, 本篇仍以爬取台灣銀行外匯牌告利率網頁為例說明如何使用 Scrapy 的 Item 與 Field 封裝結構化資料項目.

首先建立一個新專案 project3 :

scrapy startproject project3

D:\python\test\scrapy_projects>scrapy startproject project3

New Scrapy project 'project3', using template directory 'C:\Users\tony1\AppData\Local\Programs\Thonny\Lib\site-packages\scrapy\templates\project', created in:

D:\python\test\scrapy_projects\project3

You can start your first spider with:

cd project3

scrapy genspider example example.com

建立好專案後, 在第二層專案目錄 project3 下即有一個 items.py 檔 :

D:\python\test\scrapy_projects>cd project3

D:\python\test\scrapy_projects\project3>tree project3 /f

列出磁碟區新增磁碟區的資料夾 PATH

磁碟區序號為 1258-16B8

D:\PYTHON\TEST\SCRAPY_PROJECTS\PROJECT3\PROJECT3

│ items.py

│ middlewares.py

│ pipelines.py

│ settings.py

│ __init__.py

│

└─spiders

__init__.py

items.py 的預設內容如下 :

# Define here the models for your scraped items

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class Project3Item(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

pass

可見預設只是一個繼承 scrapy.Item 的空子類別定義了一個預設名稱為 Project3Item 的類別, 其中 Project3 是專案名稱. 類別名稱是可自訂的, 此例可以改為 RateItem.

要將爬蟲擷取之資料封裝到結構化的資料項目需編輯 items.py, 利用 scrapy.Item 與 scrapy.Field 類別定義資料項目的各個欄位. 以台銀牌告的 19 種貨幣來說就是定義 19 個欄位. 在前面的測試中已知這 19 個幣別串列 currency 如下 :

>>> print(currency)

['美金 (USD)', '港幣 (HKD)', '英鎊 (GBP)', '澳幣 (AUD)', '加拿大幣 (CAD)', '新加坡幣 (SGD)', '瑞士法郎 (CHF)', '日圓 (JPY)', '南非幣 (ZAR)', '瑞典幣 (SEK)', '紐元 (NZD)', '泰幣 (THB)', '菲國比索 (PHP)', '印尼幣 (IDR)', '歐元 (EUR)', '韓元 (KRW)', '越南盾 (VND)', '馬來幣 (MYR)', '人民幣 (CNY)']

而匯率串列例如 :

>>> print(rate)

['32.895', '4.229', '43.24', '22.35', '24.27', '24.61', '36.8', '0.2093', '-', '3.21', '20.13', '0.9657', '0.6246', '0.00238', '36.09', '0.02573', '0.00145', '7.477', '4.542']

要以結構化方式儲存這 19 個匯率項目可於 items.py 中定義兩個欄位 currency 與 rate. 做法是在 items.py 中繼承 Items 類別, 並用 Field 類別定義兩個資料欄位 currency 與 rate :

# items.py

import scrapy

class RateItem(scrapy.Item):

currency=scrapy.Field() # 定義儲存幣別之欄位

rate=scrapy.Field() # 定義儲存匯率之欄位

第二種寫法如下 :

# items.py

from scrapy import Item, Field

class RateItem(Item):

currency=Field() # 定義儲存幣別之欄位

rate=Field() # 定義儲存匯率之欄位

先在互動環境測試 Item 物件用法 :

>>> import scrapy

>>> class RateItem(scrapy.Item): # 定義資料項目 Item 之子類別

currency=scrapy.Field()

rate=scrapy.Field()

呼叫建構式建立資料項目物件實體並初始化 :

>>> rate1=RateItem() # 建立資料項目實體 (物件)

>>> type(rate1)

>>> rate1['currency']='美金 (USD)'

>>> rate1['rate']='32.895'

>>> rate1

{'currency': '美金 (USD)', 'rate': '32.895'}

看起來跟字典沒兩樣, 用 dir() 檢視這個 RateItem 物件內容 :

>>> dir(rate1)

['_MutableMapping__marker', '__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_class', '_values', 'clear', 'copy', 'deepcopy', 'fields', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']

可見 Item 物件其實就是一個字典物件, 其成員與字典完全相同, 例如 :

>>> rate1.keys()

dict_keys(['currency', 'rate'])

>>> rate1.values()

ValuesView({'currency': '美金 (USD)', 'rate': '32.895'})

>>> rate1.items()

ItemsView({'currency': '美金 (USD)', 'rate': '32.895'})

>>> rate1.get('currency')

'美金 (USD)'

>>> rate1.get('rate')

'32.895'

定義好 RateItem 類別後, 即可在爬蟲程式中將其實例化以封裝擷取到的目標資料.

在第一層專案目錄 project3 底下用 genspider 指令建立爬蟲程式 bot_rate_spider.py :

scrapy genspider bot_rate_spider rate.bot.com.tw

此處指定 bot_rate_spider.py 為 spiders 目錄下之爬蟲程式檔案名稱 (主檔名即可), rate.bot.com.tw 為台銀牌告匯率網站之網域 :

D:\python\test\scrapy_projects\project3>scrapy genspider bot_rate_spider rate.bot.com.tw

Created spider 'bot_rate_spider' using template 'basic' in module:

project3.spiders.bot_rate_spider

這指令會用預設的 basic 模板在 spiders 目錄下自動生成爬蟲程式檔 bot_rate_spider.py :

D:\python\test\scrapy_projects\project3>tree project3 /f

列出磁碟區新增磁碟區的資料夾 PATH

磁碟區序號為 1258-16B8

D:\PYTHON\TEST\SCRAPY_PROJECTS\PROJECT3\PROJECT3

│ items.py

│ middlewares.py

│ pipelines.py

│ settings.py

│ __init__.py

│

├─spiders

│ │ bot_rate_spider.py

│ │ __init__.py

│ │

│ └─__pycache__

│ __init__.cpython-310.pyc

│

└─__pycache__

settings.cpython-310.pyc

__init__.cpython-310.pyc

不過在修改爬蟲程式 bot_rate_spider.py 之前, 我們先在互動環境測試如何將爬取到的目標資料存入 Item 物件中.

下面為在前一篇測試中爬取到的 currency 與 rate 串列 :

>>> currency=['美金 (USD)', '港幣 (HKD)', '英鎊 (GBP)', '澳幣 (AUD)', '加拿大幣 (CAD)', '新加坡幣 (SGD)', '瑞士法郎 (CHF)', '日圓 (JPY)', '南非幣 (ZAR)', '瑞典幣 (SEK)', '紐元 (NZD)', '泰幣 (THB)', '菲國比索 (PHP)', '印尼幣 (IDR)', '歐元 (EUR)', '韓元 (KRW)', '越南盾 (VND)', '馬來幣 (MYR)', '人民幣 (CNY)']

>>> rate=['32.895', '4.229', '43.24', '22.35', '24.27', '24.61', '36.8', '0.2093', '-', '3.21', '20.13', '0.9657', '0.6246', '0.00238', '36.09', '0.02573', '0.00145', '7.477', '4.542']

可用 zip() 將這兩個串列對應的元素綁定為鍵值對形成一個字典, 這樣便能在迴圈中同步地走訪兩個串列的內容 :

>>> rate_dict={c: r for c, r in zip(currency, rate)}

>>> rate_dict

{'美金 (USD)': '32.895', '港幣 (HKD)': '4.229', '英鎊 (GBP)': '43.24', '澳幣 (AUD)': '22.35', '加拿大幣 (CAD)': '24.27', '新加坡幣 (SGD)': '24.61', '瑞士法郎 (CHF)': '36.8', '日圓 (JPY)': '0.2093', '南非幣 (ZAR)': '-', '瑞典幣 (SEK)': '3.21', '紐元 (NZD)': '20.13', '泰幣 (THB)': '0.9657', '菲國比索 (PHP)': '0.6246', '印尼幣 (IDR)': '0.00238', '歐元 (EUR)': '36.09', '韓元 (KRW)': '0.02573', '越南盾 (VND)': '0.00145', '馬來幣 (MYR)': '7.477', '人民幣 (CNY)': '4.542'}

走訪字典物件的每個項目時可以用字典物件的 items() 方法拆開項目中的鍵 (幣別 currency) 與值 (匯率 rate), 方便寫入資料項目的不同欄位中 :

>>> for c, r in rate_dict.items():

print(c, r)

美金 (USD) 32.895

港幣 (HKD) 4.229

英鎊 (GBP) 43.24

澳幣 (AUD) 22.35

加拿大幣 (CAD) 24.27

新加坡幣 (SGD) 24.61

瑞士法郎 (CHF) 36.8

日圓 (JPY) 0.2093

南非幣 (ZAR) -

瑞典幣 (SEK) 3.21

紐元 (NZD) 20.13

泰幣 (THB) 0.9657

菲國比索 (PHP) 0.6246

印尼幣 (IDR) 0.00238

歐元 (EUR) 36.09

韓元 (KRW) 0.02573

越南盾 (VND) 0.00145

馬來幣 (MYR) 7.477

人民幣 (CNY) 4.542

這樣我們就可以將拆出的 c 與 r 填入 RateItem 物件的 currency 與 rate 欄位中了 :

>>> for c, r in rate_dict.items():

rate_item=RateItem()

rate_item['currency']=c

rate_item['rate']=r

print(rate_item)

{'currency': '美金 (USD)', 'rate': '32.895'}

{'currency': '港幣 (HKD)', 'rate': '4.229'}

{'currency': '英鎊 (GBP)', 'rate': '43.24'}

{'currency': '澳幣 (AUD)', 'rate': '22.35'}

{'currency': '加拿大幣 (CAD)', 'rate': '24.27'}

{'currency': '新加坡幣 (SGD)', 'rate': '24.61'}

{'currency': '瑞士法郎 (CHF)', 'rate': '36.8'}

{'currency': '日圓 (JPY)', 'rate': '0.2093'}

{'currency': '南非幣 (ZAR)', 'rate': '-'}

{'currency': '瑞典幣 (SEK)', 'rate': '3.21'}

{'currency': '紐元 (NZD)', 'rate': '20.13'}

{'currency': '泰幣 (THB)', 'rate': '0.9657'}

{'currency': '菲國比索 (PHP)', 'rate': '0.6246'}

{'currency': '印尼幣 (IDR)', 'rate': '0.00238'}

{'currency': '歐元 (EUR)', 'rate': '36.09'}

{'currency': '韓元 (KRW)', 'rate': '0.02573'}

{'currency': '越南盾 (VND)', 'rate': '0.00145'}

{'currency': '馬來幣 (MYR)', 'rate': '7.477'}

{'currency': '人民幣 (CNY)', 'rate': '4.542'}

最後依據上面的測試結果來修改爬蟲程式如下 :

# bot_rate_spider.py

import scrapy

from project3.items import RateItem # 也可以用 ..items

class RateSpider(scrapy.Spider):

name='bot_rate_spider'

allowed_domains=['rate.bot.com.tw']

start_urls=['https://rate.bot.com.tw/xrt?Lang=zh-TW']

def parse(self, response):

xpath='//tbody/tr/td/div/div[position()=2]/text()'

currency=response.xpath(xpath).getall()

currency=[c.strip() for c in currency]

xpath='//tbody/tr/td[position()=3]/text()'

rate=response.xpath(xpath).getall()

rate_dict={c: r for c, r in zip(currency, rate)}

for c, r in rate_dict.items(): # 走訪

rate_item=RateItem() # 建立資料項目物件

rate_item['currency']=c # 儲存幣別

rate_item['rate']=r # 儲存匯率

yield rate_item # 傳回資料項目物件

此處匯入 items.py 中的 RateItem 類別時要注意路徑是否正確, 因為爬蟲程式位於 spiders 目錄下, 故必須用 .. 往上跳一層才能找到 items.py; 當然也可以直接指定第二層專案目錄 project3.

執行爬蟲程式結果如下 :

scrapy crawl bot_rate_spider -o data.json

D:\python\test\scrapy_projects\project3>scrapy crawl bot_rate_spider -o data.json

2024-07-19 11:20:26 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: project3)

2024-07-19 11:20:26 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.22631-SP0

2024-07-19 11:20:26 [scrapy.addons] INFO: Enabled addons:

[]

2024-07-19 11:20:26 [asyncio] DEBUG: Using selector: SelectSelector

2024-07-19 11:20:26 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor

2024-07-19 11:20:26 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop

2024-07-19 11:20:26 [scrapy.extensions.telnet] INFO: Telnet Password: ba80414fe81344b3

2024-07-19 11:20:26 [scrapy.middleware] INFO: Enabled extensions:

['scrapy.extensions.corestats.CoreStats',

'scrapy.extensions.telnet.TelnetConsole',

'scrapy.extensions.feedexport.FeedExporter',

'scrapy.extensions.logstats.LogStats']

2024-07-19 11:20:26 [scrapy.crawler] INFO: Overridden settings:

{'BOT_NAME': 'project3',

'FEED_EXPORT_ENCODING': 'utf-8',

'NEWSPIDER_MODULE': 'project3.spiders',

'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',

'ROBOTSTXT_OBEY': True,

'SPIDER_MODULES': ['project3.spiders'],

'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}

2024-07-19 11:20:26 [scrapy.middleware] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',

'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

'scrapy.downloadermiddlewares.retry.RetryMiddleware',

'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',

'scrapy.downloadermiddlewares.stats.DownloaderStats']

2024-07-19 11:20:26 [scrapy.middleware] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

'scrapy.spidermiddlewares.referer.RefererMiddleware',

'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

'scrapy.spidermiddlewares.depth.DepthMiddleware']

2024-07-19 11:20:26 [scrapy.middleware] INFO: Enabled item pipelines:

[]

2024-07-19 11:20:26 [scrapy.core.engine] INFO: Spider opened

2024-07-19 11:20:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2024-07-19 11:20:26 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

2024-07-19 11:20:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/robots.txt> (referer: None)

2024-07-19 11:20:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/xrt?Lang=zh-TW> (referer: None)