2024年7月2日 星期二

Python 學習筆記 : 網頁爬蟲框架 Scrapy 用法 (一)

Scrapy 是一個用 Python 打造的開放原始碼網頁爬蟲框架 (基於 Twisted 框架), 最初是由電子商務公司 Mydeco 的員工所開發, 後來由 Scrapinghub 公司接手, 目前則是由一家雲端業者 Zyte 與志願貢獻者在負責維護. Scrapy 具有跨平台, 簡單易學, 社群活躍等優點, 使用 Scrapy 可大幅降低爬蟲的開發成本與增加程式的強固性. 以前我們使用 requests 與 BeautifuSoup 徒手打造爬蟲, 而 Scrapy 則提供開發者一個系統化的開發工具來提升爬蟲程式開發效率.  

Scrapy 的優點摘要如下 : 
  • 可擷取靜態 HTML 網頁, 也可以取得 API 回傳資料.
  • 將爬蟲的三大功能與流程 : 擷取, 解析, 與儲存結果合為一體.
  • 定義了完整之爬蟲流程與模組可讓開發者快速完成爬蟲程式.
  • 使用底層的 Twisted 框架可實作非同步爬蟲, 提升爬蟲程式效率.
參考 :


不過 Scrapy 本身只能抓取靜態網頁資料, 因為 Scrapy 無法執行 Javascript, 故無法像 Selenium 那樣擷取 Javascript 生成的網頁, 但可以搭配 Splash 渲染引擎來抓, 參考 scrapy-splah 整合套件 :


網路教學文章 :


參考書籍 :

精通Python爬虫框架Scrapy (人民郵電, 2018)
# Python 網路爬蟲實戰 (松崗, 2017) 第 5 章
# Python 網路爬蟲與資料分析入門實戰 (博碩, 2018) 附錄

本系列之前的筆記參考 : 



一. 安裝 Scrapy :

因為我都是使用 Thonny 寫 Python, 因此直接開啟 "工具/系統終端機" 用 pip install 安裝 :

pip install scrapy

D:\python\test>pip install scrapy   
Collecting scrapy
  Downloading Scrapy-2.11.2-py2.py3-none-any.whl.metadata (5.3 kB)
Collecting Twisted>=18.9.0 (from scrapy)
  Downloading twisted-24.3.0-py3-none-any.whl.metadata (9.5 kB)
Requirement already satisfied: cryptography>=36.0.0 in c:\users\tony1\appdata\local\programs\thonny\lib\site-packages (from scrapy) (38.0.4)
Collecting cssselect>=0.9.1 (from scrapy)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting itemloaders>=1.0.1 (from scrapy)
  Downloading itemloaders-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting parsel>=1.5.0 (from scrapy)
  Downloading parsel-1.9.1-py2.py3-none-any.whl.metadata (11 kB)
Collecting pyOpenSSL>=21.0.0 (from scrapy)
  Downloading pyOpenSSL-24.1.0-py3-none-any.whl.metadata (12 kB)
Collecting queuelib>=1.4.2 (from scrapy)
  Downloading queuelib-1.7.0-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting service-identity>=18.1.0 (from scrapy)
  Downloading service_identity-24.1.0-py3-none-any.whl.metadata (4.8 kB)
Collecting w3lib>=1.17.0 (from scrapy)
  Downloading w3lib-2.2.1-py3-none-any.whl.metadata (2.1 kB)
Collecting zope.interface>=5.1.0 (from scrapy)
  Downloading zope.interface-6.4.post2-cp310-cp310-win_amd64.whl.metadata (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.1/44.1 kB 364.1 kB/s eta 0:00:00
Collecting protego>=0.1.15 (from scrapy)
  Downloading Protego-0.3.1-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting itemadapter>=0.1.0 (from scrapy)
  Downloading itemadapter-0.9.0-py3-none-any.whl.metadata (17 kB)
Requirement already satisfied: setuptools in c:\users\tony1\appdata\local\programs\thonny\lib\site-packages (from scrapy) (65.5.0)
Requirement already satisfied: packaging in c:\users\tony1\appdata\roaming\python\python310\site-packages (from scrapy) (23.1)
Collecting tldextract (from scrapy)
  Downloading tldextract-5.1.2-py3-none-any.whl.metadata (11 kB)
Requirement already satisfied: lxml>=4.4.1 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from scrapy) (4.9.3)
Requirement already satisfied: defusedxml>=0.7.1 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from scrapy) (0.7.1)
Collecting PyDispatcher>=2.0.5 (from scrapy)
  Downloading PyDispatcher-2.0.7-py3-none-any.whl.metadata (2.4 kB)
Requirement already satisfied: cffi>=1.12 in c:\users\tony1\appdata\local\programs\thonny\lib\site-packages (from cryptography>=36.0.0->scrapy) (1.15.1)
Collecting jmespath>=0.9.5 (from itemloaders>=1.0.1->scrapy)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting cryptography>=36.0.0 (from scrapy)
  Downloading cryptography-42.0.8-cp39-abi3-win_amd64.whl.metadata (5.4 kB)
Requirement already satisfied: attrs>=19.1.0 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from service-identity>=18.1.0->scrapy) (23.1.0)
Requirement already satisfied: pyasn1 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from service-identity>=18.1.0->scrapy) (0.5.0)
Requirement already satisfied: pyasn1-modules in c:\users\tony1\appdata\roaming\python\python310\site-packages (from service-identity>=18.1.0->scrapy) (0.3.0)
Collecting automat>=0.8.0 (from Twisted>=18.9.0->scrapy)
  Downloading Automat-22.10.0-py2.py3-none-any.whl.metadata (1.0 kB)
Collecting constantly>=15.1 (from Twisted>=18.9.0->scrapy)
  Downloading constantly-23.10.4-py3-none-any.whl.metadata (1.8 kB)
Collecting hyperlink>=17.1.1 (from Twisted>=18.9.0->scrapy)
  Downloading hyperlink-21.0.0-py2.py3-none-any.whl.metadata (1.5 kB)
Collecting incremental>=22.10.0 (from Twisted>=18.9.0->scrapy)
  Downloading incremental-22.10.0-py2.py3-none-any.whl.metadata (6.0 kB)
Collecting twisted-iocpsupport<2,>=1.0.2 (from Twisted>=18.9.0->scrapy)
  Downloading twisted_iocpsupport-1.0.4-cp310-cp310-win_amd64.whl.metadata (2.2 kB)
Requirement already satisfied: typing-extensions>=4.2.0 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from Twisted>=18.9.0->scrapy) (4.9.0)
Requirement already satisfied: idna in c:\users\tony1\appdata\roaming\python\python310\site-packages (from tldextract->scrapy) (3.4)
Requirement already satisfied: requests>=2.1.0 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from tldextract->scrapy) (2.31.0)
Collecting requests-file>=1.4 (from tldextract->scrapy)
  Downloading requests_file-2.1.0-py2.py3-none-any.whl.metadata (1.7 kB)
Requirement already satisfied: filelock>=3.0.8 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from tldextract->scrapy) (3.12.3)
Requirement already satisfied: six in c:\users\tony1\appdata\local\programs\thonny\lib\site-packages (from automat>=0.8.0->Twisted>=18.9.0->scrapy) (1.16.0)
Requirement already satisfied: pycparser in c:\users\tony1\appdata\local\programs\thonny\lib\site-packages (from cffi>=1.12->cryptography>=36.0.0->scrapy) (2.21)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from requests>=2.1.0->tldextract->scrapy) (3.2.0)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from requests>=2.1.0->tldextract->scrapy) (2.1.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from requests>=2.1.0->tldextract->scrapy) (2023.7.22)
Downloading Scrapy-2.11.2-py2.py3-none-any.whl (290 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 290.1/290.1 kB 1.4 MB/s eta 0:00:00
Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Downloading itemadapter-0.9.0-py3-none-any.whl (11 kB)
Downloading itemloaders-1.3.1-py3-none-any.whl (12 kB)
Downloading parsel-1.9.1-py2.py3-none-any.whl (17 kB)
Downloading Protego-0.3.1-py2.py3-none-any.whl (8.5 kB)
Downloading PyDispatcher-2.0.7-py3-none-any.whl (12 kB)
Downloading pyOpenSSL-24.1.0-py3-none-any.whl (56 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.9/56.9 kB 3.1 MB/s eta 0:00:00
Downloading cryptography-42.0.8-cp39-abi3-win_amd64.whl (2.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.9/2.9 MB 7.7 MB/s eta 0:00:00
Downloading queuelib-1.7.0-py2.py3-none-any.whl (13 kB)
Downloading service_identity-24.1.0-py3-none-any.whl (12 kB)
Downloading twisted-24.3.0-py3-none-any.whl (3.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 20.2 MB/s eta 0:00:00
Downloading w3lib-2.2.1-py3-none-any.whl (21 kB)
Downloading zope.interface-6.4.post2-cp310-cp310-win_amd64.whl (206 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.4/206.4 kB 13.1 MB/s eta 0:00:00
Downloading tldextract-5.1.2-py3-none-any.whl (97 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.6/97.6 kB ? eta 0:00:00
Downloading Automat-22.10.0-py2.py3-none-any.whl (26 kB)
Downloading constantly-23.10.4-py3-none-any.whl (13 kB)
Downloading hyperlink-21.0.0-py2.py3-none-any.whl (74 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74.6/74.6 kB 4.0 MB/s eta 0:00:00
Downloading incremental-22.10.0-py2.py3-none-any.whl (16 kB)
Downloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Downloading requests_file-2.1.0-py2.py3-none-any.whl (4.2 kB)
Downloading twisted_iocpsupport-1.0.4-cp310-cp310-win_amd64.whl (46 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.8/46.8 kB 2.3 MB/s eta 0:00:00
Installing collected packages: twisted-iocpsupport, PyDispatcher, incremental, zope.interface, w3lib, queuelib, protego, jmespath, itemadapter, hyperlink, cssselect, constantly, automat, Twisted, requests-file, parsel, cryptography, tldextract, service-identity, pyOpenSSL, itemloaders, scrapy
  Attempting uninstall: cryptography
    Found existing installation: cryptography 38.0.4
    Uninstalling cryptography-38.0.4:
      Successfully uninstalled cryptography-38.0.4
Successfully installed PyDispatcher-2.0.7 Twisted-24.3.0 automat-22.10.0 constantly-23.10.4 cryptography-42.0.8 cssselect-1.2.0 hyperlink-21.0.0 incremental-22.10.0 itemadapter-0.9.0 itemloaders-1.3.1 jmespath-1.0.1 parsel-1.9.1 protego-0.3.1 pyOpenSSL-24.1.0 queuelib-1.7.0 requests-file-2.1.0 scrapy-2.11.2 service-identity-24.1.0 tldextract-5.1.2 twisted-iocpsupport-1.0.4 w3lib-2.2.1 zope.interface-6.4.post2

檢視版本 :

>>> import scrapy    
>>> scrapy.__version__    
'2.11.2'
>>> scrapy.version_info   
(2, 11, 2)

Scrapy 有 Shell 指令 scrapy.exe, 在安裝 Scrapy 時會同時將此 Shell 指令放在 Python 的安裝目錄下, 我使用 Thonny 的 Python 執行環境, 其安裝目錄為  C:\Users\tony1\AppData\Local\Programs\Thonny, 因此 scrapy.exe 會被放在其子目錄 Scripts 下 : 




在命令提示字元視窗下 scrapy 沒有給任何參數會顯示其用法 :

D:\python\test>scrapy   
Scrapy 2.11.2 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

其用法類似 Django, 利用命令列指令 scrapy 來建立專案與執行爬蟲程式. 

Scrapy shell 指令用法參考 :



二. Scrapy 的架構與運作流程 :

Scrapy 的系統架構分為六大部分 :


 Scrapy 系統組件 說明
 Engine (引擎) 負責控制資料流向與觸發事件之系統核心
 Schedular (排程器) 接收來自 Engine 之請求後放入佇列中等待派遣
 Downloader (下載器) 接收來自 Engine 之請求下載網頁, 並將回應經 Engine 送給 Spider
 Spider (爬蟲) 向 Engine 發出擷取目標網頁之請求, 解析回應組成 Item 送給引擎
 Item Pipeline (項目管線) 接收 Spider 經 Engine 送來之 Item 加以清理與儲存為結構話資料
 Middleware (中介軟體) 介於 Engine 與其它組件之間的中介軟體, 用來擴充 Scrapy 功能




Source : GitHub


Scrapy 爬蟲運作流程 :
  1. Spider 向 Engine 發出擷取網頁之請求, Engine 將這些請求轉送至 Schedular 進行排程.
  2. Schedular 依排程向 Engine 發出擷取請求, Engine 將此請求送至 Downloader 進行下載. 
  3. Downloader 將收到之網頁下載回應經 Engine 轉送給 Spider 解析, 將得到之結果組成 Item 經 Engine 送至 Item Pipelines 進行清理與儲存. 
  4. Engine 向 Sceduler 發出可進行下一個請求的信號, 重複上面 2~3 步驟直到 Schedular 的排程佇列清空為止.  
Item Pipelines 組件其實是分成 items.py 與 pipelines.py 兩個檔案的, 其中 items.py 負責指定要處理哪些資料項目 (欄位), 而 pipeline.py 則負責清理與儲存資料 (例如存入 JSON 檔或資料庫等). 

以下用一個簡單範例展示如何建構一個 Scrapy 爬蟲. 


三. 建立 Scrapy 爬蟲專案 :

在命令提示字元視窗用下列指令可在目前工作目錄下建立一個 Scrapy 爬蟲專案 :

scrapy startproject project_name   

其中參數 project_name 為自訂之專案名稱. 

以下用台銀牌告匯率網頁為例說明如何製作 Scrapy 爬蟲 : 


此網頁的內容雖然每天會變化, 但本質上是一個靜態網頁, 不是 Javascript 生成的網頁, 相關說明參考之前用 requests + BeautifulSoup 抓取的測試 : 


首先建立一個放置 Scrapy 專案的總目錄 scrapy_projects :

D:\python\test>mkdir scrapy_projets   
D:\python\test>cd scrapy_projects   

然後在此目錄下用 scrapy startproject 指令建立一個 project1 專案 : 

D:\python\test\scrapy_projects>scrapy startproject project1      
New Scrapy project 'project1', using template directory 'C:\Users\tony1\AppData\Local\Programs\Thonny\Lib\site-packages\scrapy\templates\project', created in:
    D:\python\test\scrapy_projects\project1

You can start your first spider with:
    cd project1
    scrapy genspider example example.com   

與 Django 類似, 此指令會在 scrapy_projects 底下建立兩層 project1 目錄, 第一層底下只有一個專案佈署設定檔 scrapy.cfg :




第二層專案目錄 project1 底下檔案目錄如下 : 




可以用 tree /f 列出專案目錄結構樹狀圖 : 

D:\python\test\scrapy_projects>tree /f    
列出磁碟區 新增磁碟區 的資料夾 PATH
磁碟區序號為 1258-16B8
D:.
└─project1
    │  scrapy.cfg
    │
    └─project1
        │  items.py
        │  middlewares.py
        │  pipelines.py
        │  settings.py
        │  __init__.py
        │
        └─spiders
                __init__.py


第二層專案目錄 project1 下的 settings.py 是專案設定檔; items.py 用來定義結構化數據之儲存格式; pipelines.py 用來定義請求佇列支流程; middlewares.py 用來撰寫擴充功能; 而爬蟲主程式是放在 spiders 目錄下. 


四. 撰寫 Scrapy 爬蟲主程式 :

我們必須在 spiders 目錄下自訂一個爬蟲程式檔, 例如台銀匯率的爬蟲可取名為 bot_rate_spider.py, 此爬蟲程式須匯入 scrapy.spiders.Spider 類別, 然後定義一個繼承此類別之子類別, 例如 RateSpider, 並定義 name (爬蟲名稱字串) 與 start_urls (要爬的 URL 字串串列) 這兩個屬性, 以及一個 parse() 方法 (注意, 這是一個使用 yield 傳回值的生成器而非使用 return 的一般函式), 我們便是在此方法中實作擷取與剖析網頁已取得目標資料, 最後將擷取到的目標資料組成一個字典, 用 yield 傳給 Items 組件處理. 

在 Scrapy 爬蟲程式中剖析網頁可以使用下列兩種物件 :
  • BeautifulSoup 物件
  • Selector 物件
其中 BeautifulSoup 物件若使用內建的 html.parser 剖析器速度較慢, 可以安裝 lxml 套件改用 lxml 剖吸器. Selector 物件則是由 Scrapy 內建的 scrapy.selector.Selector 類別所提供, 此類別基於 lxml 套件因此速度很快, Selector 物件可使用 CSS 選擇器與 XPath 來定位元素. 

本篇以 BeautifulSoup 作為剖析器, 其爬蟲程式結構如下 :

# myspider.py
from scrapy.spiders import Spider
from bs4 import BeautifulSoup

class MySpider(Spider):
    name='project_name'     # 填入專案名稱
    start_urls=[url1, url2, url3, ...]     # 起始網址
    def parse(self, response):
        soup=BeautifulSoup(response.text, 'html.parser')  
        item1=soup.find(標籤).text
        item2=soup.select(選擇器).text
        ...
        yield {
            'item1': item1,
            'item2': item2,
            ...
            }

注意, 爬蟲類別的 name 屬性可自訂, 在用 scrapy crawl <爬蟲名稱> 指令執行爬蟲程式時即使用此 name 為爬蟲名稱. 其次, 本篇先使用 BeautifulSoup 剖析目標網頁, 後續測試會改用 Scrapy 本身的 Selector 物件之 xpath() 與 css() 方法作為剖析工具. 

下面以台銀匯率爬蟲為例, 我們在 spiders 目錄下新增一個爬蟲程式 bot_rate_spider.py 如下 :

from scrapy.spiders import Spider
from bs4 import BeautifulSoup

class RateSpider(Spider):
    name='project1'
    start_urls=['https://rate.bot.com.tw/xrt?Lang=zh-TW']
    def parse(self, response):
        soup=BeautifulSoup(response.text, 'html.parser')
        currency=[]    # 儲存幣別之串列
        rate=[]            # 儲存匯率之串列
        table=soup.find('table', {'title': '牌告匯率'})
        for tr in table.find('tbody').find_all('tr'):
            tds=tr.find_all('td')
            c=tds[0].find('div', {'class': 'visible-phone'}).text.strip()
            r=tds[2].text
            currency.append(c)
            rate.append(r)
        result={c: r for c, r in zip(currency, rate)}   # 用 zip() 將兩串列對應元素綁定
        yield result

此處從網頁中擷取目標資料的邏輯完全是從上一篇徒手打造爬蟲的測試中移植過來的, 結果會得到 currency 與 rate 這兩個串列, 分別是儲存幣別與匯率 (都是字串), 然後用 zip() 將這兩個串列中的對應元素綁定成 zip 物件, 最後用字典生成式在迴圈中走訪 zip 物件之元素以便轉成字典, 原理如下面範例所示 :

>>> currency=['美金 (USD)', '港幣 (HKD)', '英鎊 (GBP)']   
>>> rate=['32.28', '4.135', '41.19']   
>>> currency_rate={c: r for c, r in zip(currency, rate)}      
>>> currency_rate     
{'美金 (USD)': '32.28', '港幣 (HKD)': '4.135', '英鎊 (GBP)': '41.19'}   

最後用 yield 指令將此字典傳給 Item 組件處理. 

以上是手工撰寫爬蟲程式, 其實也可以用 Scrapy Shell 指令, 語法如下 :

scrapy genspider example example.com  

這在上面用 scrapy startproject project1 指令建立專案時就有提示, 其中 example 為爬蟲程式的檔案名稱, 以上面範例而言就是 bot_rate_spider; 而 example.com 為要爬的目標網站網域, 此處即台銀匯率網站網域 rate.bot.com.tw, 因此我們可用下列指令來建立爬蟲程式 :

scrapy genspider bot_rate_spider rate.bot.com.tw  

D:\python\test\scrapy_projects\project2>scrapy genspider bot_rate_spider rate.bot.com.tw   
Created spider 'bot_rate_spider' using template 'basic' in module:
  project2.spiders.bot_rate_spider   

這時 Scrapy 預設會在 spiders 目錄下依照預設的 basic 模板建立一個爬蟲程式檔 bot_rate_spider.py 程式, 開啟檢視內容如下 :

import scrapy

class BotRateSpiderSpider(scrapy.Spider):
    name = "bot_rate_spider"        # 可自訂 
    allowed_domains = ["rate.bot.com.tw"]  
    start_urls = ["https://rate.bot.com.tw"]   # 要改為 'https://rate.bot.com.tw/xrt?Lang=zh-TW'

    def parse(self, response):
        pass

程式架構與上面手工編寫的差不多 (多了可有可無的 allowed_domains 屬性), 類別名稱則可以自行修改為較短的 BotRateSpider 或 RateSpider 均可. name 屬性可自訂, start_urls 屬性則要改為實際要爬的網址. 接下來只要實作回呼函式 parse() 的內容去爬網頁並擷取目標資料作成 Item 字典 yield 給 Item Pipelines 組件即可. 


五. 執行 Scrapy 爬蟲程式 :

將命令提示字元視窗切換至第一層專案目錄 project1 下, 然後用下列指令執行爬蟲程式 :

scrapy crawl <爬蟲名稱>  

以上面的台銀匯率專案為例就是 scrapy crawl project1, 執行時會丟出一堆過程資訊, 擷取到的結果也是夾雜在其中 (放在 ' Crawled (200)' 後面) :

D:\python\test\scrapy_projects\project1>scrapy crawl project1   
2024-07-02 14:40:03 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: project1)
2024-07-02 14:40:03 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.22631-SP0
2024-07-02 14:40:03 [scrapy.addons] INFO: Enabled addons:
[]
2024-07-02 14:40:03 [asyncio] DEBUG: Using selector: SelectSelector
2024-07-02 14:40:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-07-02 14:40:03 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-07-02 14:40:03 [scrapy.extensions.telnet] INFO: Telnet Password: 153ff81e247632ac
2024-07-02 14:40:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-07-02 14:40:03 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'project1',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'project1.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['project1.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-07-02 14:40:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-07-02 14:40:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-07-02 14:40:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-07-02 14:40:03 [scrapy.core.engine] INFO: Spider opened
2024-07-02 14:40:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-02 14:40:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-02 14:40:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/robots.txt> (referer: None)
2024-07-02 14:40:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/xrt?Lang=zh-TW> (referer: None)
2024-07-02 14:40:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>
{'美金 (USD)': '32.89', '港幣 (HKD)': '4.223', '英鎊 (GBP)': '42.15', '澳幣 (AUD)': '22.05', '加拿大幣 (CAD)': '24.15', '新加坡幣 (SGD)': '24.35', '瑞士法郎 (CHF)': '36.46', '日圓 (JPY)': '0.2053', '南非幣 (ZAR)': '-', '瑞典幣 (SEK)': '3.2', '紐元 (NZD)': '20.12', '泰幣 (THB)': '0.948', '菲國比索 (PHP)': '0.6208', '印尼幣 (IDR)': '0.00233', '歐元 (EUR)': '35.53', '韓元 (KRW)': '0.02566', ' 越南盾 (VND)': '0.00147', '馬來幣 (MYR)': '7.389', '人民幣 (CNY)': '4.529'}
2024-07-02 14:40:03 [scrapy.core.engine] INFO: Closing spider (finished)
2024-07-02 14:40:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 745,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 137995,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 0.550305,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 7, 2, 6, 40, 3, 909157, tzinfo=datetime.timezone.utc),
 'item_scraped_count': 1,
 'log_count/DEBUG': 6,
 'log_count/INFO': 10,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 7, 2, 6, 40, 3, 358852, tzinfo=datetime.timezone.utc)}
2024-07-02 14:40:03 [scrapy.core.engine] INFO: Spider closed (finished)

也可以將傳回之字典輸出到 JSON 檔案, 這時指定後面要帶 -o 參數與檔名 : 

scrapy crawl <project_name>  -o <output.json>

D:\python\test\scrapy_projects\project1>scrapy crawl project1 -o data.json   
2024-07-02 15:07:33 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: project1)
2024-07-02 15:07:33 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.22631-SP0
2024-07-02 15:07:33 [scrapy.addons] INFO: Enabled addons:
[]
2024-07-02 15:07:33 [asyncio] DEBUG: Using selector: SelectSelector
2024-07-02 15:07:33 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-07-02 15:07:33 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-07-02 15:07:33 [scrapy.extensions.telnet] INFO: Telnet Password: cceb41c674adbeb8
2024-07-02 15:07:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2024-07-02 15:07:33 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'project1',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'project1.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['project1.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-07-02 15:07:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-07-02 15:07:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-07-02 15:07:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-07-02 15:07:33 [scrapy.core.engine] INFO: Spider opened
2024-07-02 15:07:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-02 15:07:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-02 15:07:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/robots.txt> (referer: None)
2024-07-02 15:07:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/xrt?Lang=zh-TW> (referer: None)
2024-07-02 15:07:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>
{'美金 (USD)': '32.895', '港幣 (HKD)': '4.223', '英鎊 (GBP)': '42.14', '澳幣 (AUD)': '22.06', '加拿大幣 (CAD)': '24.15', '新加坡幣 (SGD)': '24.35', '瑞士法郎 (CHF)': '36.46', '日圓 (JPY)': '0.2053', '南非幣 (ZAR)': '-', '瑞典幣 (SEK)': '3.2', '紐元 (NZD)': '20.11', '泰幣 (THB)': '0.9481', '菲國比索 (PHP)': '0.6207', '印尼幣 (IDR)': '0.00233', '歐元 (EUR)': '35.52', '韓元 (KRW)': '0.02565', '越南盾 (VND)': '0.00147', '馬來幣 (MYR)': '7.388', '人民幣 (CNY)': '4.53'}
2024-07-02 15:07:34 [scrapy.core.engine] INFO: Closing spider (finished)
2024-07-02 15:07:34 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: data.json
2024-07-02 15:07:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 745,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 137997,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 0.566149,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 7, 2, 7, 7, 34, 407354, tzinfo=datetime.timezone.utc),
 'item_scraped_count': 1,
 'log_count/DEBUG': 6,
 'log_count/INFO': 11,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 7, 2, 7, 7, 33, 841205, tzinfo=datetime.timezone.utc)}
2024-07-02 15:07:34 [scrapy.core.engine] INFO: Spider closed (finished)

執行過程與上面差不多, 只是多了一道輸出程序而已. 輸出檔案會放在第一層專案目錄下 (與 scrapy.cfg 同層), 開啟 data.json 內容如下 : 

[
{"美金 (USD)": "32.895", "港幣 (HKD)": "4.223", "英鎊 (GBP)": "42.13", "澳幣 (AUD)": "22.06", "加拿大幣 (CAD)": "24.15", "新加坡幣 (SGD)": "24.35", "瑞士法郎 (CHF)": "36.45", "日圓 (JPY)": "0.2053", "南非幣 (ZAR)": "-", "瑞典幣 (SEK)": "3.2", "紐元 (NZD)": "20.11", "泰幣 (THB)": "0.9481", "菲國比索 (PHP)": "0.6207", "印尼幣 (IDR)": "0.00233", "歐元 (EUR)": "35.51", "韓元 (KRW)": "0.02565", "越南盾 (VND)": "0.00147", "馬來幣 (MYR)": "7.388", "人民幣 (CNY)": "4.529"}
]

Scrapy 框架提供了撰寫爬蟲的系統化方法, 但是在撰寫爬蟲主程式之前還是一樣要先對如何擷取目標資料先做一番測試才行. 


六. 專案設定檔 settings.py :

在第二層專案目錄下有一個專案設定檔 settings.py, 其中大部分的設定預設都被註解掉, 只打開了黃底色部分 :

# Scrapy settings for project1 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "project1"  

SPIDER_MODULES = ["project1.spiders"]    
NEWSPIDER_MODULE = "project1.spiders"   


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "project1 (+http://www.yourdomain.com)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True    

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "project1.middlewares.Project1SpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "project1.middlewares.Project1DownloaderMiddleware": 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    "project1.pipelines.Project1Pipeline": 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value   
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"   
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"      
FEED_EXPORT_ENCODING = "utf-8"   

Scrapy 預設會遵守網站的爬蟲規則檔 robots.txt 的規定 (ROBOTSTXT_OBEY 預設為 True), 不會去爬其中所列之禁爬網頁, 我們可以修改 ROBOTSTXT_OBEY 設定值為 True, 這樣 Scrapy 便會忽略 robots.txt 了. 

以上 project1 測試壓縮檔放在 GitHub :


關於在 Scrapy 爬蟲專案中使用 BeautifulSoup 剖析文件的做法參考 :


沒有留言 :