小狐狸事務所: Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (五)

2024年7月19日星期五

Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (五)

在前一篇測試中, 我們利用 Scrapy 提供的 Item 與 Field 類別定義了用來儲存爬蟲目標資料的結構化資料項目類別, 但這只是將目標資料封裝到物件裡面變成結構化資料而已, 如果需要進一步處理結構化資料中某些欄位的內容, 例如去除非英文字元, 金額轉換, 四捨五入到指定位數等資料清理作業, 或者將資料存入資料庫等, 這些與資料項目相關的作業不會在爬蟲程式中處理, 而是交給 Item Pipeline (資料項目管線) 負責. 本篇主要是測試資料清理與儲存至資料庫之資料項目管線作業.

本系列之前的筆記參考 :

# Python 學習筆記 : Selenium 模組瀏覽器自動化測試 (一)

# Python 學習筆記 : Selenium 模組瀏覽器自動化測試 (二)

# Python 學習筆記 : 網頁擷取 (一) 使用 urllib 與 HTMLParser

# Python 學習筆記 : 網頁擷取 (二) 使用 requests 套件下載網頁

# Python 學習筆記 : 網頁擷取 (三) : 使用 BeautifulSoup 剖析網頁

# Python 學習筆記 : 網頁擷取 (四) : 開發網路爬蟲的步驟與工具

# Python 學習筆記 : 網頁擷取 (五) : 安裝 Chrome 擴充套件 Quick Javascript Switcher

# Python 學習筆記 : 網頁爬蟲實戰 (一) 台銀牌告匯率

# Python 學習筆記 : 網頁爬蟲實戰 (二) BBC 金融財經新聞

# Python 學習筆記 : 網頁爬蟲實戰 (三) 證交所休市日期

# Python 學習筆記 : 網頁爬蟲實戰 (四) 台北市公開資料平台 API

# Python 學習筆記 : 網頁爬蟲實戰 (五) 從 OpenWeather 擷取氣象資料

# Python 學習筆記 : 網頁爬蟲實戰 (六) 博客來書店每日一書 66 折網頁

# Python 學習筆記 : 網頁爬蟲實戰 (七) 台股上市櫃公司清單網頁

# Python 學習筆記 : 網頁爬蟲實戰 (八) 台股每日盤後資訊網頁

# Python 學習筆記 : 網頁爬蟲實戰 (九) 市立圖書館個人書房借書資訊 (上)

# Python 學習筆記 : 網頁爬蟲實戰 (九) 市立圖書館個人書房借書資訊 (中)

# Python 學習筆記 : 網頁爬蟲實戰 (九) 市立圖書館個人書房借書資訊 (下)

# Python 學習筆記 : 網頁爬蟲實戰 (十) 高科大圖書館爬蟲

# Python 學習筆記 : Selenium 4 用法 (上)

# Python 學習筆記 : Selenium 4 用法 (中)

# Python 學習筆記 : Selenium 4 用法 (下)

# Python 學習筆記 : 網頁爬蟲實戰 (十一) 集保戶股權分散表

# Python 學習筆記 : 網頁爬蟲實戰 (十二) 國發會景氣對策信號

# Python 學習筆記 : 網頁爬蟲實戰 (十三) 富時中國 A50 期貨指數

# Python 學習筆記 : 網頁爬蟲實戰 (十四) 104 人力銀行的分頁搜尋結果

# Python 學習筆記 : 網頁爬蟲實戰 (十五) NBA 球員分頁資料

# Python 學習筆記 : 網頁爬蟲實戰 (十六) books.toscrape.com 的書籍分頁資料

# Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (一)

# Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (二)

# Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (三)

# Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (四)

十. 使用 Item Pipeline 處理資料項目 :

在前一篇測試中, 爬蟲程式的 parse() 函式的任務只是單純地將爬取到的資料存入資料項目物件中而已 (亦即先把資料收進來), 如果要對 Item 物件中的欄位資料進行清理或處理, 可以透過 Item Pipeline 資料項目管線來達成 (功能模組化的概念).

在 Scrapy 中, Item Pipeline 就是負責資料處理的組件, 具體來說其實就是位於第二層專案目錄下的 pipelines.py 程式檔, 在建立一個新專案時就會自動產生此檔案. 不過 , Item Pipeline 是備選功能, 即使在 pipelines.py 中撰寫了資料項目處理功能, 還必須到 settings.py 設定檔中加入處理管線之類別與設定執行順序才會開啟處理功能.

1. 資料清理的 Item Pipeline :

本篇仍以擷取台銀牌告匯率網站上的 19 種貨幣匯率為例說明, 先建立一個 project4 專案 :

scrapy startproject project4

D:\python\test\scrapy_projects>scrapy startproject project4

New Scrapy project 'project4', using template directory 'C:\Users\tony1\AppData\Local\Programs\Thonny\Lib\site-packages\scrapy\templates\project', created in:

D:\python\test\scrapy_projects\project4

You can start your first spider with:

cd project4

scrapy genspider example example.com

然後從 project3 專案下複製 itemps.py 到 projects4 專案下覆蓋預設的 items.py, 內容不變 :

# items.py

import scrapy

class RateItem(scrapy.Item):

currency=scrapy.Field()

rate=scrapy.Field()

接著從 project3 專案下的 spiders 目錄下複製爬蟲程式 bot_rate_spider.py 到 project4 的 spiders 目錄下, 並修改為從 project4.items 匯入 RateItem 類別, 其他完全不變, 內容如下 :

# bot_rate_spider.py

import scrapy

from project4.items import RateItem # 此處須改為 project4

class RateSpider(scrapy.Spider):

name='bot_rate_spider'

allowed_domains=['rate.bot.com.tw']

start_urls=['https://rate.bot.com.tw/xrt?Lang=zh-TW']

def parse(self, response):

xpath='//tbody/tr/td/div/div[position()=2]/text()'

currency=response.xpath(xpath).getall()

currency=[c.strip() for c in currency]

xpath='//tbody/tr/td[position()=3]/text()'

rate=response.xpath(xpath).getall()

rate_dict={c: r for c, r in zip(currency, rate)}

for c, r in rate_dict.items():

rate_item=RateItem()

rate_item['currency']=c

rate_item['rate']=r

yield rate_item

如此相當於將 project3 專案複製到 project4 來, 兩者執行之功能完全相同, 本篇要在此基礎上添加 Iteem Pipeline 功能.

首先開啟第二層專案目錄下的 pipelines.py 檔, 其內容為建立專案時自動產生 :

# pipelines.py

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface

from itemadapter import ItemAdapter

class Project4Pipeline:

def process_item(self, item, spider):

return item

可知預設已有一個名為 Project4Pipeline 的類別, 此類別名稱是自訂的, 裡面有一個 process_item() 方法, 這就是用來處理資料項目的函式, 會傳入三個參數 : 物件本身 self, 資料項目 item (就是爬蟲程式 yield 出來的每一個 Item 物件), 以及正在執行的爬蟲實例 spider (可用來根據不同的爬蟲進行不同的操作或判斷).

注意, pipelines.py 預設匯入一個 itemadapter.ItemAdaptor 類別, 此調適器類別用來轉換 Item 物件為類似字典之 ItemAdapter 物件, 使其能像操作 Python 字典一樣處理 Item 物件之內容, 例如 :

>>> import scrapy

>>> from itemadapter import ItemAdapter

>>> class MyItem(scrapy.Item):

name=scrapy.Field()

price=scrapy.Field()

>>> item=MyItem(name='Apple', price=10)

>>> adapter=ItemAdapter(item)

>>> type(adapter)

>>> dir(adapter)

['ADAPTER_CLASSES', '_MutableMapping__marker', '__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_asdict', '_get_adapter_class', 'adapter', 'asdict', 'clear', 'field_names', 'get', 'get_field_meta', 'get_field_meta_from_class', 'get_field_names_from_class', 'is_item', 'is_item_class', 'item', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']

可見 ItemAdapter 具有 Python 字典的所有方法 :

>>> adapter.keys()

KeysView(<ItemAdapter for MyItem(name='Apple', price=10)>)

>>> adapter.values()

ValuesView(<ItemAdapter for MyItem(name='Apple', price=10)>)

>>> adapter.items()

ItemsView(<ItemAdapter for MyItem(name='Apple', price=10)>)

安裝 Scrapy 時會同時安裝 itemadapter 模組, 因此可直接匯入使用, 參考 :

# https://pypi.org/project/itemadapter/0.0.3/

本篇測試中首先要對每個資料項目物件的幣別做個處理, 就是將幣別英文代號的括弧去掉, 所以類別名稱可以改為 RemoveParentheses. 然後要在 process_item() 方法中實作資料清理, 移除左右小括弧, 這只需要用鏈式呼叫字串的 replace() 方法兩次即可達成, 改完後用它來更新 Item 物件的 currency 欄位即可, pipelines.py 修改如下 :

# pipelines.py

from itemadapter import ItemAdapter

class RemoveParentheses: # 自訂的類別名稱

def process_item(self, item, spider):

currency=item['currency'].replace('(', '').replace(')', '') # 去除

item['currency']=currency # 更新 currency 欄位

return item

注意, 一個專案中可以定義多個 Item Pipeline 管線類別來執行不同之資料項目處理, 但這些管線類別必須在 settings.py 設定檔的 ITEM_PIPELINES 字典中設定才會生效, 亦即添加一筆以該類別名稱設為鍵, 以及執行順序為值的字典項目, 這樣 pipelines.py 中的管線類別才會被執行.

當建立新專案時, Item pipeline 的 Item Pipeline 功能預設是關閉的, 開啟 settings.py 後搜尋 'Pipeline' 可得如下預設被取消套用之 ITEM_PIPELINES 設定 :

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

#ITEM_PIPELINES = {

# "project4.pipelines.Project3Pipeline": 300,

將 ITEM_PIPELINES 字典前面的 # 拿掉並修改類別名稱為 pipelines.py 程式中所定義的 RemoveParentheses 類別並設定其執行順序才能開啟 Item Pipeline 的功能, 如下所示 :

ITEM_PIPELINES = {

"project4.pipelines.RemoveParentheses": 300,

}

後面的值範圍為 1~1000, 當 pipelines.py 中定義了多個項目處理類別時, ITEM_PIPELINES 內也必須設定多個鍵值對, 這些資料處理作業的執行順序即由此值的大小決定, 值小的先執行.

這樣就可以來執行爬蟲程式了 :

scrapy crawl bot_rate_spider -o data.json

D:\python\test\scrapy_projects\project4>scrapy crawl bot_rate_spider -o data.json

2024-07-19 19:46:35 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: project4)

2024-07-19 19:46:35 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.22631-SP0

2024-07-19 19:46:35 [scrapy.addons] INFO: Enabled addons:

[]

2024-07-19 19:46:35 [asyncio] DEBUG: Using selector: SelectSelector

2024-07-19 19:46:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor

2024-07-19 19:46:35 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop

2024-07-19 19:46:35 [scrapy.extensions.telnet] INFO: Telnet Password: 4e7f47f963c14a9a

2024-07-19 19:46:35 [scrapy.middleware] INFO: Enabled extensions:

['scrapy.extensions.corestats.CoreStats',

'scrapy.extensions.telnet.TelnetConsole',

'scrapy.extensions.feedexport.FeedExporter',

'scrapy.extensions.logstats.LogStats']

2024-07-19 19:46:35 [scrapy.crawler] INFO: Overridden settings:

{'BOT_NAME': 'project4',

'FEED_EXPORT_ENCODING': 'utf-8',

'NEWSPIDER_MODULE': 'project4.spiders',

'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',

'ROBOTSTXT_OBEY': True,

'SPIDER_MODULES': ['project4.spiders'],

'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}

2024-07-19 19:46:35 [scrapy.middleware] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',

'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

'scrapy.downloadermiddlewares.retry.RetryMiddleware',

'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',

'scrapy.downloadermiddlewares.stats.DownloaderStats']

2024-07-19 19:46:35 [scrapy.middleware] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

'scrapy.spidermiddlewares.referer.RefererMiddleware',

'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

'scrapy.spidermiddlewares.depth.DepthMiddleware']

2024-07-19 19:46:35 [scrapy.middleware] INFO: Enabled item pipelines:

['project4.pipelines.RemoveParentheses']

2024-07-19 19:46:35 [scrapy.core.engine] INFO: Spider opened

2024-07-19 19:46:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2024-07-19 19:46:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

2024-07-19 19:46:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/robots.txt> (referer: None)

2024-07-19 19:46:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/xrt?Lang=zh-TW> (referer: None)

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '美金 USD', 'rate': '33'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '港幣 HKD', 'rate': '4.239'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '英鎊 GBP', 'rate': '43.17'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '澳幣 AUD', 'rate': '22.3'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '加拿大幣 CAD', 'rate': '24.29'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '新加坡幣 SGD', 'rate': '24.68'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '瑞士法郎 CHF', 'rate': '37.21'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '日圓 JPY', 'rate': '0.2115'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '南非幣 ZAR', 'rate': '-'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '瑞典幣 SEK', 'rate': '3.21'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '紐元 NZD', 'rate': '20.08'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '泰幣 THB', 'rate': '0.9636'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '菲國比索 PHP', 'rate': '0.6271'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '印尼幣 IDR', 'rate': '0.00238'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '歐元 EUR', 'rate': '36.15'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '韓元 KRW', 'rate': '0.02574'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '越南盾 VND', 'rate': '0.00146'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '馬來幣 MYR', 'rate': '7.482'}

2024-07-19 19:46:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>

{'currency': '人民幣 CNY', 'rate': '4.562'}

2024-07-19 19:46:35 [scrapy.core.engine] INFO: Closing spider (finished)

2024-07-19 19:46:35 [scrapy.extensions.feedexport] INFO: Stored json feed (19 items) in: data.json

2024-07-19 19:46:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 745,

'downloader/request_count': 2,

'downloader/request_method_count/GET': 2,

'downloader/response_bytes': 137734,

'downloader/response_count': 2,

'downloader/response_status_count/200': 2,

'elapsed_time_seconds': 0.39827,

'feedexport/success_count/FileFeedStorage': 1,

'finish_reason': 'finished',

'finish_time': datetime.datetime(2024, 7, 19, 11, 46, 35, 833850, tzinfo=datetime.timezone.utc),

'item_scraped_count': 19,

'log_count/DEBUG': 24,

'log_count/INFO': 11,

'response_received_count': 2,

'robotstxt/request_count': 1,

'robotstxt/response_count': 1,

'robotstxt/response_status_count/200': 1,

'scheduler/dequeued': 1,

'scheduler/dequeued/memory': 1,

'scheduler/enqueued': 1,

'scheduler/enqueued/memory': 1,

'start_time': datetime.datetime(2024, 7, 19, 11, 46, 35, 435580, tzinfo=datetime.timezone.utc)}

2024-07-19 19:46:35 [scrapy.core.engine] INFO: Spider closed (finished)

開啟輸出檔 data.json 內容如下 :

[

{"currency": "美金 USD", "rate": "33"},

{"currency": "港幣 HKD", "rate": "4.239"},

{"currency": "英鎊 GBP", "rate": "43.17"},

{"currency": "澳幣 AUD", "rate": "22.3"},

{"currency": "加拿大幣 CAD", "rate": "24.28"},

{"currency": "新加坡幣 SGD", "rate": "24.68"},

{"currency": "瑞士法郎 CHF", "rate": "37.21"},

{"currency": "日圓 JPY", "rate": "0.2115"},

{"currency": "南非幣 ZAR", "rate": "-"},

{"currency": "瑞典幣 SEK", "rate": "3.2"},

{"currency": "紐元 NZD", "rate": "20.08"},

{"currency": "泰幣 THB", "rate": "0.9631"},

{"currency": "菲國比索 PHP", "rate": "0.6271"},

{"currency": "印尼幣 IDR", "rate": "0.00238"},

{"currency": "歐元 EUR", "rate": "36.15"},

{"currency": "韓元 KRW", "rate": "0.02573"},

{"currency": "越南盾 VND", "rate": "0.00146"},

{"currency": "馬來幣 MYR", "rate": "7.482"},

{"currency": "人民幣 CNY", "rate": "4.56"}

]

可見幣別 (currency) 中的括號已經被 Item Pipeline 去除了.

以上測試之專案壓所檔可從 GitHub 下載 :

# https://github.com/tony1966/tony1966.github.io/blob/master/test/python/web_crawler/scrapy_project4_1.zip

2. 輸出至資料庫的 Item Pipeline (寫法 1) :

在上面的 project4 專案基礎上, 我們想在爬取到目標資料, 經過上面的資料清理管線去除 currency 欄位的括號後, 將資料項目存入 SQLite 資料庫, 這必須在 pipelines.py 裡面添加一個類別來處理, 除了 process_item() 方法外, 還需定義 open_spider() 與 close_spider() 方法, 功能如下 :

open_spider() : 連接資料庫 & 建立資料表
process_item() : 處理資料項目 & 執行 SQL 指令
close_spider() : 完成資料庫交易 & 關閉資料庫連線

操作 SQLite 資料庫需要匯入 Python 內建的 sqlite3 套件, SQLite 是一個輕量級的關聯式資料庫, 與微軟的 ACCESS 資料庫一樣為單一檔案資料庫, 備份保存與轉移非常容易, 無需伺服器連線即可運作, 用法參考 :

# Python 學習筆記 : 資料庫存取測試 (一) SQLite

修改第二層專案目錄 project4 下的 pipelines.py, 先匯入 sqlite3 套件, 然後新增一個處理管線類別 Save2SQLite 並定義上述的三個方法 :

# pipelines.py

from itemadapter import ItemAdapter

import sqlite3

class RemoveParentheses:

def process_item(self, item, spider):

currency=item['currency'].replace('(', '').replace(')', '') # 去除左右括號

item['currency']=currency # 更新幣別欄位值

return item # 回傳資料項目

class Save2SQLite:

def open_spider(self, spider):

self.conn=sqlite3.connect('bot_rate_spider.sqlite') # 連接資料庫檔

self.cur=self.conn.cursor() # 建立 Cursor 物件

SQL='CREATE TABLE if not exists bot_rate(' +\

'currency TEXT, rate TEXT)' # 若資料表 bot_rate 不存在就建立

self.cur.execute(SQL) # 建立資料表

def process_item(self, item, spider):

SQL='INSERT INTO bot_rate(currency, rate) VALUES("' +\

item['currency'] + '","' + item['rate'] + '")' # SQL 插入指令

self.cur.execute(SQL) # 新增紀錄

return item # 回傳資料項目

def close_spider(self, spider):

self.conn.commit() # 執行資料庫交易

self.conn.close() # 關閉資料庫連線

注意, 由於 currency 與 rate 欄位型態均為 TEXT, 所以值必須用括號括起來, 否則執行時會出現錯誤. 其次, 在 open_spider() 方法中指定了資料庫檔 bot_rate_spider.sqlite, 但不需要預先手動建立此空的資料庫檔, 爬蟲程式執行時會自動在第一層專案目錄 project4 下建立此 .sqlite 檔.

接著必須修改專案設定檔 settings.py 中的 ITEM_PIPELINES 字典, 添加新增的處理管線類別 Save2SQLite 與其執行順序編號組成的鍵值對 :

# settings.py

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

"project4.pipelines.RemoveParentheses": 300, # 先執行

"project4.pipelines.Save2SQLite": 400, # 後執行

}

此處順序值範圍為 1~1000, Save2SQLite 的順序比 RemoveParentheses 大表示執行順序較後, 因此會先執行 RemoveParentheses 刪除括號後再進行儲存至資料表的動作.

執行結果如下 :

scrapy crawl bot_rate_spider -o data.json

D:\python\test\scrapy_projects\project4>scrapy crawl bot_rate_spider -o data.json

2024-07-20 10:21:37 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: project4)

2024-07-20 10:21:37 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.22631-SP0

2024-07-20 10:21:37 [scrapy.addons] INFO: Enabled addons:

[]

2024-07-20 10:21:37 [asyncio] DEBUG: Using selector: SelectSelector

2024-07-20 10:21:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor

2024-07-20 10:21:37 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop

2024-07-20 10:21:37 [scrapy.extensions.telnet] INFO: Telnet Password: 866130966a0be4e8

2024-07-20 10:21:37 [scrapy.middleware] INFO: Enabled extensions:

['scrapy.extensions.corestats.CoreStats',

'scrapy.extensions.telnet.TelnetConsole',

'scrapy.extensions.feedexport.FeedExporter',

'scrapy.extensions.logstats.LogStats']

2024-07-20 10:21:37 [scrapy.crawler] INFO: Overridden settings:

{'BOT_NAME': 'project4',

'FEED_EXPORT_ENCODING': 'utf-8',

'NEWSPIDER_MODULE': 'project4.spiders',

'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',

'ROBOTSTXT_OBEY': True,

'SPIDER_MODULES': ['project4.spiders'],

'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}

2024-07-20 10:21:37 [scrapy.middleware] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',

'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

'scrapy.downloadermiddlewares.retry.RetryMiddleware',

'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',

'scrapy.downloadermiddlewares.stats.DownloaderStats']

2024-07-20 10:21:37 [scrapy.middleware] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

'scrapy.spidermiddlewares.referer.RefererMiddleware',

'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

'scrapy.spidermiddlewares.depth.DepthMiddleware']

2024-07-20 10:21:37 [scrapy.middleware] INFO: Enabled item pipelines:

['project4.pipelines.RemoveParentheses', 'project4.pipelines.Save2SQLite']

2024-07-20 10:21:37 [scrapy.core.engine] INFO: Spider opened

2024-07-20 10:21:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2024-07-20 10:21:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

2024-07-20 10:21:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/robots.txt> (referer: None)

2024-07-20 10:21:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/xrt?Lang=zh-TW> (referer: None)