2024年7月4日 星期四

Python 學習筆記 : 在樹莓派 Pi 3 跑 Scrapy 爬蟲的 twisted 版本問題

昨晚在高雄家的 Pi 3 上安裝了 Anydesk, 今天就來用它連接遠端桌面, 複製前天在 Win11 中所建立的 Scrapy 爬蟲來執行看看. 

本系列之前的筆記參考 : 



1. 安裝 Scrapy : 

pi@raspberrypi:~ $ pip3 install scrapy   
Looking in indexes: https://pypi.org/simple, https://www.piwheels.org/simple
Collecting scrapy
  Downloading https://files.pythonhosted.org/packages/29/87/4129a19d4d56092ed0f63e0739832d145690981631d32f3ac6e438ec0d25/Scrapy-2.9.0-py2.py3-none-any.whl (277kB)
Collecting PyDispatcher>=2.0.5; platform_python_implementation == "CPython" (from scrapy)
  Downloading https://files.pythonhosted.org/packages/66/0e/9ee7bc0b48ec45d93b302fa2d787830dca4dc454d31a237faa5815995988/PyDispatcher-2.0.7-py3-none-any.whl
Collecting itemadapter>=0.1.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/3b/9e/e1a5a2882d5a3cbf9018d18102edc4cc34de8a207e6c5eb765784298fb48/itemadapter-0.8.0-py3-none-any.whl
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from scrapy) (40.8.0)
Collecting cryptography>=3.4.6 (from scrapy)
  Downloading https://www.piwheels.org/simple/cryptography/cryptography-42.0.8-cp37-cp37m-linux_armv7l.whl (1.4MB)
  Collecting parsel>=1.5.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/36/d9/b67d9f251a69037c79bac90f975c84696f5ca68045bd1b97e68804625757/parsel-1.8.1-py2.py3-none-any.whl
Requirement already satisfied: packaging in ./.local/lib/python3.7/site-packages (from scrapy) (21.3)
Collecting protego>=0.1.15 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/bc/16/14fd1ecdece2e1d87279fc09fbd2d55bae5fa033783c3547af631c74d718/Protego-0.3.0-py2.py3-none-any.whl
Collecting queuelib>=1.4.2 (from scrapy)
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))': /simple/queuelib/
  Downloading https://files.pythonhosted.org/packages/06/1e/9e3bfb6a10253f5d95acfed9c5732f4abc2ef87bdf985594ddfb99d222da/queuelib-1.6.2-py2.py3-none-any.whl
Collecting Twisted>=18.9.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/2a/e3/9fe9cf016d32d050a2eec518c2f5156f7623b42e1ef3f2fa3e80c0ef654c/twisted-23.8.0-py3-none-any.whl (3.1MB)
Collecting w3lib>=1.17.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/82/e2/dcf8573d7153194eb673347cea1f9bbdb2a8e61030740fb6f50e4234a00b/w3lib-2.1.2-py3-none-any.whl
Collecting cssselect>=0.9.1 (from scrapy)
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))': /simple/cssselect/
  Downloading https://files.pythonhosted.org/packages/06/a9/2da08717a6862c48f1d61ef957a7bba171e7eefa6c0aa0ceb96a140c2a6b/cssselect-1.2.0-py2.py3-none-any.whl
Collecting itemloaders>=1.0.1 (from scrapy)
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))': /simple/itemloaders/
  Downloading https://files.pythonhosted.org/packages/9a/98/d03afe36d01b1adf03435bb306d0e7f87d498c94ba8db5290f716b350bb8/itemloaders-1.1.0-py3-none-any.whl
Collecting tldextract (from scrapy)
  Downloading https://files.pythonhosted.org/packages/38/38/ef579e09695b406075370265e3f74030a6190556351e3af23c2f402c9a41/tldextract-4.0.0-py3-none-any.whl (97kB)
 Collecting zope.interface>=5.1.0 (from scrapy)
  Downloading https://www.piwheels.org/simple/zope-interface/zope.interface-6.4.post2-cp37-cp37m-linux_armv7l.whl (218kB)
 Collecting service-identity>=18.1.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/93/5a/5e93f280ec7be676b5a57f305350f439d31ced168bca04e6ffa64b575664/service_identity-21.1.0-py2.py3-none-any.whl
Requirement already satisfied: lxml>=4.3.0 in ./.local/lib/python3.7/site-packages (from scrapy) (4.9.1)
Collecting pyOpenSSL>=21.0.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/54/a7/2104f674a5a6845b04c8ff01659becc6b8978ca410b82b94287e0b1e018b/pyOpenSSL-24.1.0-py3-none-any.whl (56kB)
 Collecting cffi>=1.12; platform_python_implementation != "PyPy" (from cryptography>=3.4.6->scrapy)
  Downloading https://www.piwheels.org/simple/cffi/cffi-1.15.1-cp37-cp37m-linux_armv7l.whl (219kB)
 Requirement already satisfied: typing-extensions; python_version < "3.8" in ./.local/lib/python3.7/site-packages (from parsel>=1.5.0->scrapy) (4.7.1)
Collecting jmespath (from parsel>=1.5.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/31/b4/b9b800c45527aadd64d5b442f9b932b00648617eb5d63d2c7a6587b7cafc/jmespath-1.0.1-py3-none-any.whl
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in ./.local/lib/python3.7/site-packages (from packaging->scrapy) (3.0.9)
Collecting incremental>=22.10.0 (from Twisted>=18.9.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/77/51/8073577012492fcd15628e811db585f447c500fa407e944ab3a18ec55fb7/incremental-22.10.0-py2.py3-none-any.whl
Collecting hyperlink>=17.1.1 (from Twisted>=18.9.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/6e/aa/8caf6a0a3e62863cbb9dab27135660acba46903b703e224f14f447e57934/hyperlink-21.0.0-py2.py3-none-any.whl (74kB)
 Requirement already satisfied: attrs>=21.3.0 in ./.local/lib/python3.7/site-packages (from Twisted>=18.9.0->scrapy) (23.2.0)
Collecting zope-interface>=5 (from Twisted>=18.9.0->scrapy)
  Using cached https://www.piwheels.org/simple/zope-interface/zope.interface-6.4.post2-cp37-cp37m-linux_armv7l.whl
Collecting automat>=0.8.0 (from Twisted>=18.9.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/29/90/64aabce6c1b820395452cc5472b8f11cd98320f40941795b8069aef4e0e0/Automat-22.10.0-py2.py3-none-any.whl
Collecting constantly>=15.1 (from Twisted>=18.9.0->scrapy)
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))': /simple/constantly/
  Downloading https://files.pythonhosted.org/packages/b9/65/48c1909d0c0aeae6c10213340ce682db01b48ea900a7d9fce7a7910ff318/constantly-15.1.0-py2.py3-none-any.whl
Requirement already satisfied: idna in /usr/lib/python3/dist-packages (from tldextract->scrapy) (2.6)
Collecting filelock>=3.0.8 (from tldextract->scrapy)
  Downloading https://files.pythonhosted.org/packages/00/45/ec3407adf6f6b5bf867a4462b2b0af27597a26bd3cd6e2534cb6ab029938/filelock-3.12.2-py3-none-any.whl
Collecting requests-file>=1.4 (from tldextract->scrapy)
  Downloading https://files.pythonhosted.org/packages/d7/25/dd878a121fcfdf38f52850f11c512e13ec87c2ea72385933818e5b6c15ce/requests_file-2.1.0-py2.py3-none-any.whl
Requirement already satisfied: requests>=2.1.0 in ./.local/lib/python3.7/site-packages (from tldextract->scrapy) (2.31.0)
Collecting pyasn1-modules (from service-identity>=18.1.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/cd/8e/bea464350e1b8c6ed0da3a312659cb648804a08af6cacc6435867f74f8bd/pyasn1_modules-0.3.0-py2.py3-none-any.whl (181kB)
 Requirement already satisfied: six in /usr/lib/python3/dist-packages (from service-identity>=18.1.0->scrapy) (1.12.0)
Collecting pyasn1 (from service-identity>=18.1.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/d1/75/4686d2872bf2fc0b37917cbc8bbf0dd3a5cdb0990799be1b9cbf1e1eb733/pyasn1-0.5.1-py2.py3-none-any.whl (84kB)
Collecting pycparser (from cffi>=1.12; platform_python_implementation != "PyPy"->cryptography>=3.4.6->scrapy)
  Downloading https://files.pythonhosted.org/packages/62/d5/5f610ebe421e85889f2e55e33b7f9a6795bd982198517d912eb1c76e1a53/pycparser-2.21-py2.py3-none-any.whl (118kB)
 Requirement already satisfied: importlib-metadata; python_version < "3.8" in ./.local/lib/python3.7/site-packages (from attrs>=21.3.0->Twisted>=18.9.0->scrapy) (6.7.0)
Requirement already satisfied: urllib3<3,>=1.21.1 in ./.local/lib/python3.7/site-packages (from requests>=2.1.0->tldextract->scrapy) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in ./.local/lib/python3.7/site-packages (from requests>=2.1.0->tldextract->scrapy) (2024.2.2)
Requirement already satisfied: charset-normalizer<4,>=2 in ./.local/lib/python3.7/site-packages (from requests>=2.1.0->tldextract->scrapy) (2.1.1)
Requirement already satisfied: zipp>=0.5 in ./.local/lib/python3.7/site-packages (from importlib-metadata; python_version < "3.8"->attrs>=21.3.0->Twisted>=18.9.0->scrapy) (3.15.0)
Installing collected packages: PyDispatcher, itemadapter, pycparser, cffi, cryptography, jmespath, cssselect, w3lib, parsel, protego, queuelib, incremental, hyperlink, zope-interface, automat, constantly, Twisted, itemloaders, filelock, requests-file, tldextract, zope.interface, pyasn1, pyasn1-modules, service-identity, pyOpenSSL, scrapy
Successfully installed PyDispatcher-2.0.7 Twisted-23.8.0 automat-22.10.0 cffi-1.15.1 constantly-15.1.0 cryptography-42.0.8 cssselect-1.2.0 filelock-3.12.2 hyperlink-21.0.0 incremental-22.10.0 itemadapter-0.8.0 itemloaders-1.1.0 jmespath-1.0.1 parsel-1.8.1 protego-0.3.0 pyOpenSSL-24.1.0 pyasn1-0.5.1 pyasn1-modules-0.3.0 pycparser-2.21 queuelib-1.6.2 requests-file-2.1.0 scrapy-2.9.0 service-identity-21.1.0 tldextract-4.0.0 w3lib-2.1.2 zope-interface zope.interface-6.4.post2

pi@raspberrypi:~ $ python3   
Python 3.7.3 (default, Jan 22 2021, 20:04:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy   
>>> scrapy.__version__    
'2.9.0'

可見安裝的 Scrapy 不是最新的, 而是 v2.9.0 版. 

也可以直接在株端機執行 scrapy shell 指令 :

pi@raspberrypi:~ $ scrapy   
Scrapy 2.9.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command


2. 建立 Scrapy 專案 : 

先建立一個 scrapy_projects 目錄, 然後在底下用 Scrapy shell 指令建立一個 project1 專案 :

pi@raspberrypi:~ $ mkdir scrapy_projects    
pi@raspberrypi:~ $ cd scrapy_projects  
pi@raspberrypi:~/scrapy_projects $ scrapy startproject project1   
New Scrapy project 'project1', using template directory '/home/pi/.local/lib/python3.7/site-packages/scrapy/templates/project', created in:
    /home/pi/scrapy_projects/project1

You can start your first spider with:
    cd project1
    scrapy genspider example example.com

用 tree 指令顯示專案中的檔案目錄結構 : 

pi@raspberrypi:~/scrapy_projects $ tree project1   
project1
├── project1
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

2 directories, 7 files

這跟 Windows 上的是一樣的. 


3. 執行台銀匯率爬蟲 :

我將前一篇的台銀匯率網頁爬蟲專案從 GitHub 下載到樹莓派蓋掉上面建立的 project1 :


解開後執行 scrapy craw project1 結果出現錯誤 : 

pi@raspberrypi:~/scrapy_projects/project1 $ scrapy crawl project1   
2024-07-05 09:55:37 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: project1)
2024-07-05 09:55:37 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.4, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 23.8.0, Python 3.7.3 (default, Jan 22 2021, 20:04:44) - [GCC 8.3.0], pyOpenSSL 24.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 42.0.8, Platform Linux-5.10.17-v7+-armv7l-with-debian-10.9
2024-07-05 09:55:37 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'project1',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'project1.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['project1.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-07-05 09:55:37 [asyncio] DEBUG: Using selector: EpollSelector
2024-07-05 09:55:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-07-05 09:55:37 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-07-05 09:55:37 [scrapy.extensions.telnet] INFO: Telnet Password: edd3164037861353
2024-07-05 09:55:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2024-07-05 09:55:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-07-05 09:55:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-07-05 09:55:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-07-05 09:55:39 [scrapy.core.engine] INFO: Spider opened
2024-07-05 09:55:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 09:55:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
Traceback (most recent call last):
  File "/home/pi/.local/bin/scrapy", line 10, in <module>
    sys.exit(execute())
  File "/home/pi/.local/lib/python3.7/site-packages/scrapy/cmdline.py", line 158, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/home/pi/.local/lib/python3.7/site-packages/scrapy/cmdline.py", line 111, in _run_print_help
    func(*a, **kw)
  File "/home/pi/.local/lib/python3.7/site-packages/scrapy/cmdline.py", line 166, in _run_command
    cmd.run(args, opts)
  File "/home/pi/.local/lib/python3.7/site-packages/scrapy/commands/crawl.py", line 30, in run
    self.crawler_process.start()
  File "/home/pi/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 383, in start
    install_shutdown_handlers(self._signal_shutdown)
  File "/home/pi/.local/lib/python3.7/site-packages/scrapy/utils/ossignal.py", line 19, in install_shutdown_handlers
    reactor._handleSignals()
AttributeError: 'AsyncioSelectorReactor' object has no attribute '_handleSignals'   

搜尋谷歌找到下面這篇 :


原來是 twisted 的 bug 問題, 上面安裝的是 23.8.0, 似乎太新了, 要安裝 Twisted v22.10.0 才行 :

pi@raspberrypi:~/scrapy_projects/project1 $ pip3 install Twisted==22.10.0   
Looking in indexes: https://pypi.org/simple, https://www.piwheels.org/simple
Collecting Twisted==22.10.0
  Downloading https://files.pythonhosted.org/packages/ac/63/b5540d15dfeb7388fbe12fa55a902c118fd2b324be5430cdeac0c0439489/Twisted-22.10.0-py3-none-any.whl (3.1MB)
Requirement already satisfied: typing-extensions>=3.6.5 in /home/pi/.local/lib/python3.7/site-packages (from Twisted==22.10.0) (4.7.1)
Requirement already satisfied: incremental>=21.3.0 in /home/pi/.local/lib/python3.7/site-packages (from Twisted==22.10.0) (22.10.0)
Requirement already satisfied: hyperlink>=17.1.1 in /home/pi/.local/lib/python3.7/site-packages (from Twisted==22.10.0) (21.0.0)
Requirement already satisfied: constantly>=15.1 in /home/pi/.local/lib/python3.7/site-packages (from Twisted==22.10.0) (15.1.0)
Requirement already satisfied: zope.interface>=4.4.2 in /home/pi/.local/lib/python3.7/site-packages (from Twisted==22.10.0) (6.4.post2)
Requirement already satisfied: attrs>=19.2.0 in /home/pi/.local/lib/python3.7/site-packages (from Twisted==22.10.0) (23.2.0)
Requirement already satisfied: Automat>=0.8.0 in /home/pi/.local/lib/python3.7/site-packages (from Twisted==22.10.0) (22.10.0)
Requirement already satisfied: idna>=2.5 in /usr/lib/python3/dist-packages (from hyperlink>=17.1.1->Twisted==22.10.0) (2.6)
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from zope.interface>=4.4.2->Twisted==22.10.0) (40.8.0)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /home/pi/.local/lib/python3.7/site-packages (from attrs>=19.2.0->Twisted==22.10.0) (6.7.0)
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from Automat>=0.8.0->Twisted==22.10.0) (1.12.0)
Requirement already satisfied: zipp>=0.5 in /home/pi/.local/lib/python3.7/site-packages (from importlib-metadata; python_version < "3.8"->attrs>=19.2.0->Twisted==22.10.0) (3.15.0)
Installing collected packages: Twisted
  Found existing installation: twisted 23.8.0
    Uninstalling twisted-23.8.0:
      Successfully uninstalled twisted-23.8.0
Successfully installed Twisted-22.10.0

再次執行爬蟲程式就可以順利執行了 : 

pi@raspberrypi:~/scrapy_projects/project1 $ scrapy crawl project1   
2024-07-05 10:04:20 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: project1)
2024-07-05 10:04:20 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.4, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.7.3 (default, Jan 22 2021, 20:04:44) - [GCC 8.3.0], pyOpenSSL 24.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 42.0.8, Platform Linux-5.10.17-v7+-armv7l-with-debian-10.9
2024-07-05 10:04:20 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'project1',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'project1.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['project1.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-07-05 10:04:20 [asyncio] DEBUG: Using selector: EpollSelector
2024-07-05 10:04:20 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-07-05 10:04:20 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-07-05 10:04:20 [scrapy.extensions.telnet] INFO: Telnet Password: f8a4f93f94072d0c
2024-07-05 10:04:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2024-07-05 10:04:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-07-05 10:04:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-07-05 10:04:22 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-07-05 10:04:22 [scrapy.core.engine] INFO: Spider opened
2024-07-05 10:04:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 10:04:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-05 10:04:23 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): publicsuffix.org:443
2024-07-05 10:04:24 [urllib3.connectionpool] DEBUG: https://publicsuffix.org:443 "GET /list/public_suffix_list.dat HTTP/1.1" 200 86861
2024-07-05 10:04:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/robots.txt> (referer: None)
2024-07-05 10:04:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/xrt?Lang=zh-TW> (referer: None)
2024-07-05 10:04:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>
{'美金 (USD)': '32.705', '港幣 (HKD)': '4.201', '英鎊 (GBP)': '42.32', '澳幣 (AUD)': '22.23', '加拿大幣 (CAD)': '24.25', '新加坡幣 (SGD)': '24.34', '瑞士法郎 (CHF)': '36.45', '日圓 (JPY)': '0.205', '南非幣 (ZAR)': '-', '瑞典幣 (SEK)': '3.22', '紐元 (NZD)': '20.23', '泰幣 (THB)': '0.9484', '菲國比索 (PHP)': '0.6199', '印尼幣 (IDR)': '0.00233', '歐元 (EUR)': '35.6', '韓元 (KRW)': '0.02568', '越南盾 (VND)': '0.00146', '馬來幣 (MYR)': '7.368', '人民幣 (CNY)': '4.512'}
2024-07-05 10:04:26 [scrapy.core.engine] INFO: Closing spider (finished)
2024-07-05 10:04:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 743,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 137997,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 2.652907,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 7, 5, 2, 4, 26, 284798),
 'item_scraped_count': 1,
 'log_count/DEBUG': 8,
 'log_count/INFO': 10,
 'memusage/max': 45330432,
 'memusage/startup': 45330432,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 7, 5, 2, 4, 23, 631891)}
2024-07-05 10:04:26 [scrapy.core.engine] INFO: Spider closed (finished)

看起來在 Windows 上安裝的 Scrapy-2.11.2 與 Twisted-24.3.0 搭配是沒問題的, 但在樹莓派上的 Scrapy-2.9.0 搭 Twisted-23.8.0 卻不行, 需搭配 Twisted-22,10.0. 

下面是將結果輸出到檔案 data.json :

pi@raspberrypi:~/scrapy_projects/project1 $ scrapy crawl project1 -o data.json   
2024-07-05 10:36:14 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: project1)
2024-07-05 10:36:14 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.4, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.7.3 (default, Jan 22 2021, 20:04:44) - [GCC 8.3.0], pyOpenSSL 24.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 42.0.8, Platform Linux-5.10.17-v7+-armv7l-with-debian-10.9
2024-07-05 10:36:14 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'project1',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'project1.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['project1.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-07-05 10:36:14 [asyncio] DEBUG: Using selector: EpollSelector
2024-07-05 10:36:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-07-05 10:36:14 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-07-05 10:36:14 [scrapy.extensions.telnet] INFO: Telnet Password: 7a0555dee0ea0368
2024-07-05 10:36:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2024-07-05 10:36:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-07-05 10:36:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-07-05 10:36:16 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-07-05 10:36:16 [scrapy.core.engine] INFO: Spider opened
2024-07-05 10:36:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 10:36:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-05 10:36:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/robots.txt> (referer: None)
2024-07-05 10:36:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rate.bot.com.tw/xrt?Lang=zh-TW> (referer: None)
2024-07-05 10:36:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rate.bot.com.tw/xrt?Lang=zh-TW>
{'美金 (USD)': '32.725', '港幣 (HKD)': '4.204', '英鎊 (GBP)': '42.35', '澳幣 (AUD)': '22.25', '加拿大幣 (CAD)': '24.26', '新加坡幣 (SGD)': '24.36', '瑞士法郎 (CHF)': '36.48', '日圓 (JPY)': '0.2052', '南非幣 (ZAR)': '-', '瑞典幣 (SEK)': '3.23', '紐元 (NZD)': '20.23', '泰幣 (THB)': '0.9492', '菲國比索 (PHP)': '0.6201', '印尼幣 (IDR)': '0.00233', '歐元 (EUR)': '35.63', '韓元 (KRW)': '0.02569', '越南盾 (VND)': '0.00146', '馬來幣 (MYR)': '7.372', '人民幣 (CNY)': '4.515'}
2024-07-05 10:36:19 [scrapy.core.engine] INFO: Closing spider (finished)
2024-07-05 10:36:19 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: data.json
2024-07-05 10:36:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 743,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 138002,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 2.056378,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 7, 5, 2, 36, 19, 424326),
 'item_scraped_count': 1,
 'log_count/DEBUG': 6,
 'log_count/INFO': 11,
 'memusage/max': 45248512,
 'memusage/startup': 45248512,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 7, 5, 2, 36, 17, 367948)}
2024-07-05 10:36:19 [scrapy.core.engine] INFO: Spider closed (finished)

用 nano 開啟 data.json 結果如下 : 

pi@raspberrypi:~/scrapy_projects/project1 $ nano data.json   

[
{"美金 (USD)": "32.725", "港幣 (HKD)": "4.204", "英鎊 (GBP)": "42.35", "澳幣 (AUD)": "22.25", "加拿大幣 (CAD)": "24.26", "新加坡$
]

OK! 以後就可以在樹莓派上佈署 Scrapy 爬蟲專案了. 

沒有留言:

張貼留言