2022年3月16日 星期三

SpaCy 學習筆記 (一) : 安裝 Spacy 與語言模型

SpaCy 是一款工業級 (industrial-grade) 的 Python 自然語言處理 (NLP) 套件, 提供了各式各樣的預訓練 (pre-trained) 語言模型與簡潔的 NLP 介面, 主要開發者為軟體公司 Explosion 的兩位創辦人 Matthew Honnibal and Ines Montani, 採用 MIT 授權開放原始碼. SpaCy 在 2021 年初釋出第三版, 提供了以 transfformer 為基礎的 NLP 任務管線 (pipeline), 同時也不再支援 Python 2. SpaCy 使用 Cython 開發, 故執行速度很快, 參考 :


SpaCy 與 NLTK 套件的差異在於 NLTK 主要用在教學研究, 而 SpaCy 則是用於真實世界的工業級實戰應用 (例如 Chatbot). 此外, SpaCy 支援深度學習, 利用自身的機器學習模組 Thinc 做為後端可連接由 TensorFlow, PyTorch, 與 MXNet 等框架訓練出來的統計模型, 利用卷積神經網路 (CNN) 進行詞類標註 (POS tagging), 相依性剖析 (dependence parsing), 文章分類, 以及命名實體辨識 (NER recognition) 等, 預建的統計式神經網路模型可對 17 種語言進行這些 NLP 任務; 在斷詞 (tokenization) 方面, SpaCy 可在超過 65 種語言上讓使用者以自己的資料集訓練客制化的模型. 此外, SpaCy 也內建例如詞向量 (word vector) 等高級的 NLP 功能, 而 NLTK 則需要第三方套件例如 Gensim 來支援.  

參考 : 


目前 SpaCy 相關的書籍較少, 只有下面幾本 : 



Source : 天瓏




Source : 天瓏




Source : 天瓏


一. 安裝 SpaCy : 

其實我在去年 8/3 就已在筆電上安裝了 SpaCy (版本為 3.1.1), 但那時沒有時間進行測試. 今天找到當時的安裝紀錄如下, 在 Win10 只要用 pip 連線安裝即可 : 

C:\Users\User>pip install spacy     
Collecting spacy
  Downloading spacy-3.1.1-cp37-cp37m-win_amd64.whl (11.8 MB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.5-cp37-cp37m-win_amd64.whl (108 kB)
Requirement already satisfied: typing-extensions<4.0.0.0,>=3.7.4 in c:\python37\lib\site-packages (from spacy) (3.7.4.3)
Requirement already satisfied: packaging>=20.0 in c:\python37\lib\site-packages (from spacy) (20.7)
Requirement already satisfied: jinja2 in c:\python37\lib\site-packages (from spacy) (2.10.1)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.1-cp37-cp37m-win_amd64.whl (450 kB)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in c:\python37\lib\site-packages (from spacy) (2.21.0)
Collecting blis<0.8.0,>=0.4.0
  Downloading blis-0.7.4-cp37-cp37m-win_amd64.whl (6.5 MB)
Collecting thinc<8.1.0,>=8.0.8
  Downloading thinc-8.0.8-cp37-cp37m-win_amd64.whl (1.0 MB)
Collecting spacy-legacy<3.1.0,>=3.0.7
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting typer<0.4.0,>=0.3.0
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Requirement already satisfied: setuptools in c:\python37\lib\site-packages (from spacy) (51.0.0)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.5-cp37-cp37m-win_amd64.whl (35 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.5-cp37-cp37m-win_amd64.whl (20 kB)
Requirement already satisfied: numpy>=1.15.0 in c:\python37\lib\site-packages (from spacy) (1.19.4)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in c:\python37\lib\site-packages (from spacy) (4.54.1)
Collecting wasabi<1.1.0,>=0.8.1
  Downloading wasabi-0.8.2-py3-none-any.whl (23 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-win_amd64.whl (1.9 MB)
 Collecting catalogue<2.1.0,>=2.0.4
  Downloading catalogue-2.0.4-py3-none-any.whl (16 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.0-py3-none-any.whl (42 kB)
Requirement already satisfied: zipp>=0.5 in c:\python37\lib\site-packages (from catalogue<2.1.0,>=2.0.4->spacy) (3.1.0)
Requirement already satisfied: pyparsing>=2.0.2 in c:\python37\lib\site-packages (from packaging>=20.0->spacy) (2.3.1)
Collecting smart-open<6.0.0,>=5.0.0
  Downloading smart_open-5.1.0-py3-none-any.whl (57 kB)
Requirement already satisfied: certifi>=2017.4.17 in c:\python37\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2018.11.29)
Requirement already satisfied: idna<2.9,>=2.5 in c:\python37\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\python37\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\python37\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.24.1)
Collecting click<7.2.0,>=7.1.1
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
Requirement already satisfied: MarkupSafe>=0.23 in c:\python37\lib\site-packages (from jinja2->spacy) (1.1.1)
Installing collected packages: murmurhash, cymem, click, catalogue, wasabi, typer, srsly, smart-open, pydantic, preshed, blis, thinc, spacy-legacy, pathy, spacy
  Attempting uninstall: click
    Found existing installation: Click 7.0
    Uninstalling Click-7.0:
      Successfully uninstalled Click-7.0
  Attempting uninstall: smart-open
    Found existing installation: smart-open 4.0.1
    Uninstalling smart-open-4.0.1:
      Successfully uninstalled smart-open-4.0.1
Successfully installed blis-0.7.4 catalogue-2.0.4 click-7.1.2 cymem-2.0.5 murmurhash-1.0.5 pathy-0.6.0 preshed-3.0.5 pydantic-1.8.2 smart-open-5.1.0 spacy-3.1.1 spacy-legacy-3.0.8 srsly-2.4.1 thinc-8.0.8 typer-0.3.2 wasabi-0.8.2

可見 SpaCy 的相依套件非常多, 離線安裝會很麻煩, 因需下載的檔案甚多. 安裝完畢進入 Python 執行環境, 直接 import spacy 可能會出現 "Could not load dynamic library 'cudart64_101.dll'" 錯誤, 這是因為我的筆電並未配備 GPU 之故, 但這並不影響 SpaCy 的使用 : 

C:\Users\User>python    
Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy     
2021-08-03 19:21:43.919588: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-08-03 19:21:43.921914: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.  
>>>

過了半年多, SpaCy 已經升版到 v3.2.3 了, 所以我用 -U 參數直接更新到最新版 : 

C:\Users\User>pip install -U spacy   
Requirement already satisfied: spacy in c:\python37\lib\site-packages (3.1.1)
Collecting spacy
  Downloading spacy-3.2.3-cp37-cp37m-win_amd64.whl (11.5 MB)
     ---------------------------------------- 11.5/11.5 MB 3.5 MB/s eta 0:00:00
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in c:\python37\lib\site-packages (from spacy) (1.8.2)
Requirement already satisfied: packaging>=20.0 in c:\python37\lib\site-packages (from spacy) (20.7)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in c:\python37\lib\site-packages (from spacy) (0.3.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in c:\python37\lib\site-packages (from spacy) (4.54.1)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in c:\python37\lib\site-packages (from spacy) (0.7.4)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in c:\python37\lib\site-packages (from spacy) (3.0.5)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in c:\python37\lib\site-packages (from spacy) (2.0.5)
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in c:\python37\lib\site-packages (from spacy) (1.0.5)
Requirement already satisfied: setuptools in c:\python37\lib\site-packages (from spacy) (51.0.0)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in c:\python37\lib\site-packages (from spacy) (2.4.1)
Requirement already satisfied: numpy>=1.15.0 in c:\python37\lib\site-packages (from spacy) (1.19.4)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in c:\python37\lib\site-packages (from spacy) (0.8.2)
Collecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.13-cp37-cp37m-win_amd64.whl (1.0 MB)
     ---------------------------------------- 1.0/1.0 MB 3.0 MB/s eta 0:00:00
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.8 in c:\python37\lib\site-packages (from spacy) (3.0.8)
Requirement already satisfied: jinja2 in c:\python37\lib\site-packages (from spacy) (2.10.1)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
     ---------------------------------------- 181.6/181.6 KB 2.2 MB/s eta 0:00:00
Requirement already satisfied: pathy>=0.3.5 in c:\python37\lib\site-packages (from spacy) (0.6.0)
Requirement already satisfied: typing-extensions<4.0.0.0,>=3.7.4 in c:\python37\lib\site-packages (from spacy) (3.7.4.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in c:\python37\lib\site-packages (from spacy) (2.21.0)
Requirement already satisfied: zipp>=0.5 in c:\python37\lib\site-packages (from catalogue<2.1.0,>=2.0.6->spacy) (3.1.0)
Requirement already satisfied: pyparsing>=2.0.2 in c:\python37\lib\site-packages (from packaging>=20.0->spacy) (2.3.1)
Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in c:\python37\lib\site-packages (from pathy>=0.3.5->spacy) (5.1.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\python37\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\python37\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2018.11.29)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\python37\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.24.1)
Requirement already satisfied: idna<2.9,>=2.5 in c:\python37\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.8)
Requirement already satisfied: click<7.2.0,>=7.1.1 in c:\python37\lib\site-packages (from typer<0.5.0,>=0.3.0->spacy) (7.1.2)
Requirement already satisfied: MarkupSafe>=0.23 in c:\python37\lib\site-packages (from jinja2->spacy) (1.1.1)
Installing collected packages: spacy-loggers, langcodes, catalogue, thinc, spacy
  Attempting uninstall: catalogue
    Found existing installation: catalogue 2.0.4
    Uninstalling catalogue-2.0.4:
      Successfully uninstalled catalogue-2.0.4
  Attempting uninstall: thinc
    Found existing installation: thinc 8.0.8
    Uninstalling thinc-8.0.8:
      Successfully uninstalled thinc-8.0.8
  Attempting uninstall: spacy
    Found existing installation: spacy 3.1.1
    Uninstalling spacy-3.1.1:
      Successfully uninstalled spacy-3.1.1
Successfully installed catalogue-2.0.6 langcodes-3.3.0 spacy-3.2.3 spacy-loggers-1.0.1 thinc-8.0.13

安裝好 SpaCy 後可用下列指令檢查版本 : 

C:\Users\User>python -m spacy info    
2022-03-15 15:29:33.665770: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2022-03-15 15:29:33.666522: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

============================== Info about spaCy ==============================

spaCy version    3.2.3
Location         C:\Python37\lib\site-packages\spacy
Platform         Windows-10-10.0.19041-SP0
Python version   3.7.2
Pipelines

也可以進入 Python 環境後用 spacy.__version__ 檢查 :

C:\Users\User>python    
Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy     
2022-03-15 08:47:12.656539: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2022-03-15 08:47:12.657791: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
>>> spacy.__version__    
'3.2.3'

以下是在樹莓派上的安裝結果 : 

pi@raspberrypi:~ $ pip3 install spacy   
Looking in indexes: https://pypi.org/simple, https://www.piwheels.org/simple
Collecting spacy
  Downloading https://www.piwheels.org/simple/spacy/spacy-3.1.1-cp37-cp37m-linux_armv7l.whl (23.8MB)
Collecting srsly<3.0.0,>=2.4.1 (from spacy)
  Downloading https://www.piwheels.org/simple/srsly/srsly-2.4.1-cp37-cp37m-linux_armv7l.whl (794kB)
Collecting spacy-legacy<3.1.0,>=3.0.7 (from spacy)
  Downloading https://files.pythonhosted.org/packages/d3/e8/1bc00eeff3faf1c50bde941f88a491a5c1128debb75dd8c913401e71585c/spacy_legacy-3.0.8-py2.py3-none-any.whl
Collecting catalogue<2.1.0,>=2.0.4 (from spacy)
  Downloading https://files.pythonhosted.org/packages/9c/10/dbc1203a4b1367c7b02fddf08cb2981d9aa3e688d398f587cea0ab9e3bec/catalogue-2.0.4-py3-none-any.whl
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading https://www.piwheels.org/simple/murmurhash/murmurhash-1.0.5-cp37-cp37m-linux_armv7l.whl (80kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading https://www.piwheels.org/simple/preshed/preshed-3.0.5-cp37-cp37m-linux_armv7l.whl (556kB)
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from spacy) (40.8.0)
Requirement already satisfied: typing-extensions<4.0.0.0,>=3.7.4; python_version < "3.8" in ./.local/lib/python3.7/site-packages (from spacy) (3.10.0.0)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading https://www.piwheels.org/simple/cymem/cymem-2.0.5-cp37-cp37m-linux_armv7l.whl (143kB)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/lib/python3/dist-packages (from spacy) (2.21.0)
Collecting typer<0.4.0,>=0.3.0 (from spacy)
  Downloading https://files.pythonhosted.org/packages/90/34/d138832f6945432c638f32137e6c79a3b682f06a63c488dcfaca6b166c64/typer-0.3.2-py3-none-any.whl
Collecting wasabi<1.1.0,>=0.8.1 (from spacy)
  Downloading https://files.pythonhosted.org/packages/a6/1d/d281571b4c3b20fff183b485c6673c62878727119a849c7932651a8b5060/wasabi-0.8.2-py3-none-any.whl
Collecting tqdm<5.0.0,>=4.38.0 (from spacy)
  Downloading https://files.pythonhosted.org/packages/0b/e8/d6f4db0886dbba2fc87b5314f2d5127acdc782e4b51e6f86972a2e45ffd6/tqdm-4.62.0-py2.py3-none-any.whl (76kB)
Collecting packaging>=20.0 (from spacy)
  Downloading https://files.pythonhosted.org/packages/3c/77/e2362b676dc5008d81be423070dd9577fa03be5da2ba1105811900fda546/packaging-21.0-py3-none-any.whl (40kB)
Collecting pathy>=0.3.5 (from spacy)
  Downloading https://files.pythonhosted.org/packages/65/ae/ecfa3e2dc267010fa320034be0eb3a8e683dc98dae7e70f92b41605b4d35/pathy-0.6.0-py3-none-any.whl (42kB)
Requirement already satisfied: numpy>=1.15.0 in /usr/lib/python3/dist-packages (from spacy) (1.16.2)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 (from spacy)
  Downloading https://files.pythonhosted.org/packages/ff/74/54e030641601112309f6d2af620774e9080f99c7a15742fc6a0b170c4076/pydantic-1.8.2-py3-none-any.whl (126kB)
Requirement already satisfied: jinja2 in /usr/lib/python3/dist-packages (from spacy) (2.10)
Collecting thinc<8.1.0,>=8.0.8 (from spacy)
  Downloading https://www.piwheels.org/simple/thinc/thinc-8.0.8-cp37-cp37m-linux_armv7l.whl (1.8MB)
Collecting blis<0.8.0,>=0.4.0 (from spacy)
  Downloading https://www.piwheels.org/simple/blis/blis-0.7.4-cp37-cp37m-linux_armv7l.whl (2.2MB)
Collecting zipp>=0.5; python_version < "3.8" (from catalogue<2.1.0,>=2.0.4->spacy)
  Downloading https://files.pythonhosted.org/packages/92/d9/89f433969fb8dc5b9cbdd4b4deb587720ec1aeb59a020cf15002b9593eef/zipp-3.5.0-py3-none-any.whl
Collecting click<7.2.0,>=7.1.1 (from typer<0.4.0,>=0.3.0->spacy)
  Downloading https://files.pythonhosted.org/packages/d2/3d/fa76db83bf75c4f8d338c2fd15c8d33fdd7ad23a9b5e57eb6c5de26b430e/click-7.1.2-py2.py3-none-any.whl (82kB)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/lib/python3/dist-packages (from packaging>=20.0->spacy) (2.2.0)
Collecting smart-open<6.0.0,>=5.0.0 (from pathy>=0.3.5->spacy)
  Downloading https://files.pythonhosted.org/packages/e9/90/6ca525991e281ecdf204c5c1de854da6334068e44121c384b68c6a838e14/smart_open-5.1.0-py3-none-any.whl (57kB)
Installing collected packages: zipp, catalogue, srsly, spacy-legacy, murmurhash, cymem, preshed, click, typer, wasabi, tqdm, packaging, smart-open, pathy, pydantic, blis, thinc, spacy
  The script tqdm is installed in '/home/pi/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  The script pathy is installed in '/home/pi/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  The script spacy is installed in '/home/pi/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed blis-0.7.4 catalogue-2.0.4 click-7.1.2 cymem-2.0.5 murmurhash-1.0.5 packaging-21.0 pathy-0.6.0 preshed-3.0.5 pydantic-1.8.2 smart-open-5.1.0 spacy-3.1.1 spacy-legacy-3.0.8 srsly-2.4.1 thinc-8.0.8 tqdm-4.62.0 typer-0.3.2 wasabi-0.8.2 zipp-3.5.0


二. 安裝特定之語言統計模型 : 

上面安裝的 SpaCy 本身並沒有附帶執行 NLP 管線任務所需之語言統計模型, 而是視任務需要另外安裝. SpaCy 的語言模型是收集自各種來源的特定語言之知識, 讓我們能執行各種 NLP 任務, 例如斷詞 (tokenization), 詞類標註 (POS tagging), 以及命名實體識別 (NER recognition) 等. 

SpaCy 對不同的語言準備了依大小分類的預訓練統計模型, 其命名格式為 "語言_類型_領域_大小", 以英文核心類型 core (包含單字, 語法, 專有名詞等通用之模型) 中的 web 領域 (來源為 Wiki 或社群網站等) 為例有如下四個模型 : 
  • en_core_web_sm : 小模型 (14 MB)
  • en_core_web_md : 中模型 (43 MB)
  • en_core_web_lg : 大模型 (741 MB)
  • en_core_web_trf : Transformer 模型 (438 MB)
其中 en 為英文, 其它所支援之語言簡名例如 zh (中文), fr (法文), de (德文), ja (日文) 等, SpaCy 目前有支援的全部語言可參考官網 :


模型的大小反映其資料量, 故 lg 所佔空間最大, 這些模型都以 Python 套件的形式存在 (故又稱為 model package), 這些模型必須視需要個別安裝, 安裝這些統計模型的方式有兩個, 一是使用如下 download 指令 :

python -m spacy download 模型名稱     

參考 : 


例如 : 

C:\Users\User>python -m spacy download en_core_web_sm   
2022-03-15 18:16:18.702879: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2022-03-15 18:16:18.704737: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
     ---------------------------------------- 13.9/13.9 MB 1.4 MB/s eta 0:00:00
Requirement already satisfied: spacy<3.3.0,>=3.2.0 in c:\python37\lib\site-packages (from en-core-web-sm==3.2.0) (3.2.3)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.7.4)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.3.0)
Requirement already satisfied: thinc<8.1.0,>=8.0.12 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (8.0.13)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.0.5)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (4.54.1)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.0.1)
Requirement already satisfied: setuptools in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (51.0.0)
Requirement already satisfied: typing-extensions<4.0.0.0,>=3.7.4 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.7.4.3)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.8 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.8)
Requirement already satisfied: jinja2 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.10.1)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.8.2)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.6)
Requirement already satisfied: packaging>=20.0 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (20.7)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.3.2)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.5)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.8.2)
Requirement already satisfied: pathy>=0.3.5 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.6.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.5)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.4.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.21.0)
Requirement already satisfied: numpy>=1.15.0 in c:\python37\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.19.4)
Requirement already satisfied: zipp>=0.5 in c:\python37\lib\site-packages (from catalogue<2.1.0,>=2.0.6->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.1.0)
Requirement already satisfied: pyparsing>=2.0.2 in c:\python37\lib\site-packages (from packaging>=20.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.3.1)
Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in c:\python37\lib\site-packages (from pathy>=0.3.5->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (5.1.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\python37\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in c:\python37\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in c:\python37\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2018.11.29)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\python37\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.24.1)
Requirement already satisfied: click<7.2.0,>=7.1.1 in c:\python37\lib\site-packages (from typer<0.5.0,>=0.3.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (7.1.2)
Requirement already satisfied: MarkupSafe>=0.23 in c:\python37\lib\site-packages (from jinja2->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.1.1)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')

第二種安裝模型的方式是使用 pip 指令, 但必須指定從 SpaCy 的 GitHub 模型寄存庫 URL 安裝, 不能直接用模型名稱, 因為這些模型並未上傳至 PyPi 網站. 

首先在 SpaCy 官網的英文模型網頁中找到 en_core_web_sm 這個模型 :






按模型名稱右下角的 "RELEASE DETAILS" 超連結會前往 GitHub 寄存庫 : 




將此網頁拉到最底下, 會看到 .whl 與 .gz 這兩種模型檔案, 選任何一個都可以, 按滑鼠右鍵選 "複製聯結網址" : 



然後貼到 pip install 指令後面就可以了, 例如從 whl 檔安裝 :

pip install https://github.com/explosion/spacy-models/releases/download/zh_core_web_trf-3.2.0/zh_core_web_trf-3.2.0-py3-none-any.whl  

安裝好 en_core_web_sm 模型就可以開始使用 SpaCy 來做自然語言處理了. 


三. NLP 處理管線 (pipeline) : 

安裝好 SpaCy 與語言統計模型後就可以載入此模型來解析文句. SpaCy 是以任務管線 (task pipeline) 來處理鎖輸入的文句, 其結構與程序如下 : 




這五個管線程序在將文句輸入給 SpaCy 的語言物件時就自動完成了, 事實上只需要三個指令 SpaCy 就完成這五道程序了 :

import spacy  
nlp=spacy.load('en_core_web_sm')
doc=nlp('I am going to visit the White House.')

例如 : 

>>> import spacy    
2022-03-15 23:45:55.429882: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2022-03-15 23:45:55.430106: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
>>> nlp=spacy.load('en_core_web_sm')    
>>> type(nlp)   
<class 'spacy.lang.en.English'>   

可見呼叫 spacy.load() 載入統計模型後會傳回一個語言物件 (通常取名為 nlp), 此處為 English 英文物件. 注意, spacy.load() 必須傳入完整的模型名稱, 在 SpaCy v3 以前可以傳入捷徑名稱例如 'en', 這會自動載入最小的模型, 但在 SpaCy 第三版之後已廢棄這種作法. 

下面使用 eval() 檢視此語言物件的成員有哪些 :

>>> members=dir(nlp)    
>>> for mbr in members:    
...     obj=eval('nlp.' + mbr)    
...     if not mbr.startswith('_'):    
...         print(mbr, type(obj))    
...
Defaults <class 'type'>
add_pipe <class 'method'>
analyze_pipes <class 'method'>
batch_size <class 'int'>
begin_training <class 'method'>
component <class 'method'>
component_names <class 'spacy.util.SimpleFrozenList'>
components <class 'spacy.util.SimpleFrozenList'>
config <class 'thinc.config.Config'>
create_optimizer <class 'method'>
create_pipe <class 'method'>
create_pipe_from_source <class 'method'>
default_config <class 'thinc.config.Config'>
default_error_handler <class 'function'>
disable_pipe <class 'method'>
disable_pipes <class 'method'>
disabled <class 'spacy.util.SimpleFrozenList'>
enable_pipe <class 'method'>
evaluate <class 'method'>
factories <class 'spacy.util.SimpleFrozenDict'>
factory <class 'method'>
factory_names <class 'spacy.util.SimpleFrozenList'>
from_bytes <class 'method'>
from_config <class 'method'>
from_disk <class 'method'>
get_factory_meta <class 'method'>
get_factory_name <class 'method'>
get_pipe <class 'method'>
get_pipe_config <class 'method'>
get_pipe_meta <class 'method'>
has_factory <class 'method'>
has_pipe <class 'method'>
initialize <class 'method'>
lang <class 'str'>
make_doc <class 'method'>
max_length <class 'int'>
meta <class 'dict'>
path <class 'pathlib.WindowsPath'>
pipe <class 'method'>
pipe_factories <class 'spacy.util.SimpleFrozenDict'>
pipe_labels <class 'spacy.util.SimpleFrozenDict'>
pipe_names <class 'spacy.util.SimpleFrozenList'>
pipeline <class 'spacy.util.SimpleFrozenList'>
rehearse <class 'method'>
remove_pipe <class 'method'>
rename_pipe <class 'method'>
replace_listeners <class 'method'>
replace_pipe <class 'method'>
resume_training <class 'method'>
select_pipes <class 'method'>
set_error_handler <class 'method'>
set_factory_meta <class 'method'>
to_bytes <class 'method'>
to_disk <class 'method'>
tokenizer <class 'spacy.tokenizer.Tokenizer'>
update <class 'method'>
use_params <class 'method'>
vocab <class 'spacy.vocab.Vocab'>

一般在應用上不會用到這些成員, 而是直接將輸入文句傳給語言物件 English 進行管線處理, 它會傳回一個 Doc 物件, 例如 :

>>> doc=nlp('I am going to visit the White House.')   
>>> type(doc)     
<class 'spacy.tokens.doc.Doc'>     

這個 Doc 物件就是文句經過上面那五個管線處理後的結果, 亦即文句已被拆分成 token (文字, 數字, 與標點符號的統稱) 並轉成原形, 還加上詞類與語法標記等, 可用 eval() 產生成員名稱來檢視 Doc 物件的成員有哪些 :

>>> members=dir(doc)   
>>> for mbr in members:   
...     obj=eval('doc.' + mbr)      
...     if not mbr.startswith('_'):      
...         print(mbr, type(obj))       
...
cats <class 'dict'>
char_span <class 'builtin_function_or_method'>
copy <class 'builtin_function_or_method'>
count_by <class 'builtin_function_or_method'>
doc <class 'spacy.tokens.doc.Doc'>
ents <class 'tuple'>
extend_tensor <class 'builtin_function_or_method'>
from_array <class 'builtin_function_or_method'>
from_bytes <class 'builtin_function_or_method'>
from_dict <class 'builtin_function_or_method'>
from_disk <class 'builtin_function_or_method'>
from_docs <class 'builtin_function_or_method'>
get_extension <class 'builtin_function_or_method'>
get_lca_matrix <class 'builtin_function_or_method'>
has_annotation <class 'builtin_function_or_method'>
has_extension <class 'builtin_function_or_method'>
has_unknown_spaces <class 'bool'>
has_vector <class 'bool'>
__main__:1: DeprecationWarning: [W107] The property `Doc.is_nered` is deprecated. Use `Doc.has_annotation("ENT_IOB")` instead.
is_nered <class 'bool'>
__main__:1: DeprecationWarning: [W107] The property `Doc.is_parsed` is deprecated. Use `Doc.has_annotation("DEP")` instead.
is_parsed <class 'bool'>
__main__:1: DeprecationWarning: [W107] The property `Doc.is_sentenced` is deprecated. Use `Doc.has_annotation("SENT_START")` instead.
is_sentenced <class 'bool'>
__main__:1: DeprecationWarning: [W107] The property `Doc.is_tagged` is deprecated. Use `Doc.has_annotation("TAG")` instead.
is_tagged <class 'bool'>
lang <class 'int'>
lang_ <class 'str'>
mem <class 'cymem.cymem.Pool'>
noun_chunks <class 'generator'>
noun_chunks_iterator <class 'function'>
remove_extension <class 'builtin_function_or_method'>
retokenize <class 'builtin_function_or_method'>
sentiment <class 'float'>
sents <class 'generator'>
set_ents <class 'builtin_function_or_method'>
set_extension <class 'builtin_function_or_method'>
similarity <class 'builtin_function_or_method'>
spans <class 'spacy.tokens._dict_proxies.SpanGroups'>
tensor <class 'numpy.ndarray'>
text <class 'str'>
text_with_ws <class 'str'>
to_array <class 'builtin_function_or_method'>
to_bytes <class 'builtin_function_or_method'>
to_dict <class 'builtin_function_or_method'>
to_disk <class 'builtin_function_or_method'>
to_json <class 'builtin_function_or_method'>
to_utf8_array <class 'builtin_function_or_method'>
user_data <class 'dict'>
user_hooks <class 'dict'>
user_span_hooks <class 'dict'>
user_token_hooks <class 'dict'>
vector <class 'numpy.ndarray'>
vector_norm <class 'float'>
vocab <class 'spacy.vocab.Vocab'>

SpaCy 將五道管線分析的結果放在 Doc 物件中 , 此 Doc 物件其實是由分詞所得的 Token 物件所組成, 可以用迴圈來檢視 : 

>>> for token in doc:    
    print(token.text, type(token))       
    
I <class 'spacy.tokens.token.Token'>
am <class 'spacy.tokens.token.Token'>
going <class 'spacy.tokens.token.Token'>
to <class 'spacy.tokens.token.Token'>
visit <class 'spacy.tokens.token.Token'>
the <class 'spacy.tokens.token.Token'>
White <class 'spacy.tokens.token.Token'>
House <class 'spacy.tokens.token.Token'>
. <class 'spacy.tokens.token.Token'>

Doc 物件其實就是 Token 物件組成的串列, 可以用 [] 運算子以索引存取, 例如 : 

>>> doc[0]   
I
>>> type(doc[0])     
<class 'spacy.tokens.token.Token'>
>>> doc[1]   
am

同樣可用 eval() 來檢視 Token 物件的成員有哪些, 程式碼如下 :

import spacy
nlp=spacy.load('en_core_web_sm')
doc=nlp('I am going to visit the White House.')
members=dir(doc[0])     
for mbr in members:   
    obj=eval('doc[0].' + mbr)      
    if not mbr.startswith('_'):      
        print(mbr, type(obj)) 

執行結果為 : 

ancestors <class 'generator'>
check_flag <class 'builtin_function_or_method'>
children <class 'generator'>
cluster <class 'int'>
conjuncts <class 'tuple'>
dep <class 'int'>
dep_ <class 'str'>
doc <class 'spacy.tokens.doc.Doc'>
ent_id <class 'int'>
ent_id_ <class 'str'>
ent_iob <class 'int'>
ent_iob_ <class 'str'>
ent_kb_id <class 'int'>
ent_kb_id_ <class 'str'>
ent_type <class 'int'>
ent_type_ <class 'str'>
get_extension <class 'builtin_function_or_method'>
has_dep <class 'builtin_function_or_method'>
has_extension <class 'builtin_function_or_method'>
has_head <class 'builtin_function_or_method'>
has_morph <class 'builtin_function_or_method'>
has_vector <class 'bool'>
head <class 'spacy.tokens.token.Token'>
i <class 'int'>
idx <class 'int'>
iob_strings <class 'builtin_function_or_method'>
is_alpha <class 'bool'>
is_ancestor <class 'builtin_function_or_method'>
is_ascii <class 'bool'>
is_bracket <class 'bool'>
is_currency <class 'bool'>
is_digit <class 'bool'>
is_left_punct <class 'bool'>
is_lower <class 'bool'>
is_oov <class 'bool'>
is_punct <class 'bool'>
is_quote <class 'bool'>
is_right_punct <class 'bool'>
is_sent_end <class 'bool'>
is_sent_start <class 'bool'>
is_space <class 'bool'>
is_stop <class 'bool'>
is_title <class 'bool'>
is_upper <class 'bool'>
lang <class 'int'>
lang_ <class 'str'>
left_edge <class 'spacy.tokens.token.Token'>
lefts <class 'generator'>
lemma <class 'int'>
lemma_ <class 'str'>
lex <class 'spacy.lexeme.Lexeme'>
lex_id <class 'int'>
like_email <class 'bool'>
like_num <class 'bool'>
like_url <class 'bool'>
lower <class 'int'>
lower_ <class 'str'>
morph <class 'spacy.tokens.morphanalysis.MorphAnalysis'>
n_lefts <class 'int'>
n_rights <class 'int'>
nbor <class 'builtin_function_or_method'>
norm <class 'int'>
norm_ <class 'str'>
orth <class 'int'>
orth_ <class 'str'>
pos <class 'int'>
pos_ <class 'str'>
prefix <class 'int'>
prefix_ <class 'str'>
prob <class 'float'>
rank <class 'int'>
remove_extension <class 'builtin_function_or_method'>
right_edge <class 'spacy.tokens.token.Token'>
rights <class 'generator'>
sent <class 'spacy.tokens.span.Span'>
sent_start <class 'bool'>
sentiment <class 'float'>
set_extension <class 'builtin_function_or_method'>
set_morph <class 'builtin_function_or_method'>
shape <class 'int'>
shape_ <class 'str'>
similarity <class 'builtin_function_or_method'>
subtree <class 'generator'>
suffix <class 'int'>
suffix_ <class 'str'>
tag <class 'int'>
tag_ <class 'str'>
tensor <class 'numpy.ndarray'>
text <class 'str'>
text_with_ws <class 'str'>
vector <class 'numpy.ndarray'>
vector_norm <class 'numpy.float32'>
vocab <class 'spacy.vocab.Vocab'>
whitespace_ <class 'str'>

其中較常用的 Token 物件屬性如下表 : 


 Token 物件常用屬性 說明
 text 輸入文句
 lemma_ 還原字形
 tag_ 語法標註
 pos_ 詞類 (POS)
 dep_ 語法相依標籤
 head 語法相依的上層 Token 物件 (parent)


下列程式顯示 Doc 物件中每一個 Token 物件的 text, lemma_, tag_, pos_, 以及 dep_ 等屬性之值 : 


測試 1 : 顯示 Doc 物件 (串列) 中每一個 Token 物件之主要屬性值 [看原始碼]

import spacy
nlp=spacy.load('en_core_web_sm')
doc=nlp('I am going to visit the White House.')
print(f'text\tlemma_\ttag_\tpos_\tdep_\thead')  
for token in doc:
    fstr=(f'{token.text}\t{token.lemma_}\t{token.tag_}\t{token.pos_}\t'
          f'{token.dep_}\t{token.head}')
    print(fstr)

此處因為格式化 f 字串太長, 因此分成兩個 f 字串外面用一個小括弧括起來串接, 參考下面這篇文章中 No.17 的回覆作法 :


結果如下 : 




其中 lemma_ 是 text 的原形, tag_ 是語法標註 (動詞進行式 VBG 等), pos_ 是詞類 (動詞, 助詞, 代名詞等), 而 dep_ 與 head 則是用來描述語法相依性的屬性. 

屬性 dep_ 用來記錄 Token 的語法相依性標籤 (syntactic dependency label), 在 SpaCy 的語法剖析樹結構裡, 述語中的主要動詞 (main verb) 的 dep_ 會被標記為 ROOT, 表示它是語法樹的根或父節點 (parent), 其它 Token 為子節點, 且會被指派一個語法相依標籤給它, 紀錄該子節點的 dep_ 屬性中. 例如上例中的 going 是主動詞, 其 dep_ 被指派為 ROOT; 而 I 的 dep_ 屬性是 nsubj, 表示它是 ROOT 的主詞. 這些都是 SpaCy 語法剖析的結果, 有助於判斷語意.

沒有留言:

張貼留言