小狐狸事務所: NLTK 學習筆記 (三) : nltk.book.text1~9 語料庫

2021年11月14日星期日

NLTK 學習筆記 (三) : nltk.book.text1~9 語料庫

NLTK 安裝以來一直沒有時間學習, 雖然我更想學較新的 SpaCy, 但 NLTK 是 NLP 很經典的自然語言學習套件, 所以還是要涉獵一番, 這樣在學 SpaCy 時也好有個比較的對象. 本篇主要是檢視 NLTK 語料庫中的 book.text1~9 這 9 個語料庫, 是最近閱讀下列這本書第一章的測試筆記 :

# Natural Language Processing with Python (Oreilly, 2009)

本系列之前的文章參考 :

# NLTK 學習筆記 (一) : 安裝 NLTK 套件與語料庫

# NLTK 學習筆記 (二) : 離線安裝 NLTK 語料庫

1. Text 物件的屬性與方法 :

在 nltk.book 子套件中收錄了 text1~text9 共九本書的語料庫, 它們都是 Text 物件, 首先從 nltk.book 匯入全部物件, 它會馬上回應 text1~textt9 這九本書的書名 :

>>> from nltk.book import *

*** Introductory Examples for the NLTK Book ***

Loading text1, ..., text9 and sent1, ..., sent9

Type the name of the text or sentence to view it.

Type: 'texts()' or 'sents()' to list the materials.

text1: Moby Dick by Herman Melville 1851

text2: Sense and Sensibility by Jane Austen 1811

text3: The Book of Genesis

text4: Inaugural Address Corpus

text5: Chat Corpus

text6: Monty Python and the Holy Grail

text7: Wall Street Journal

text8: Personals Corpus

text9: The Man Who Was Thursday by G . K . Chesterton 1908

以內建函式 type() 檢查 nltk.book.text1~9 可知 text1~text9 均為 Text 物件, 呼叫 dir() 函式可檢視 Text 物件的成員, 以 text1 為例 :

>>> type(text1) # text1~9 都是 Text 物件

將 text1~text9 傳入 dir() 會傳回 Text 物件的成員串列, 以 text1 為例 :

>>> dir(text1) # 傳回 Text 物件的成員串列

['_CONTEXT_RE', '_COPY_TOKENS', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_context', '_train_default_ngram_lm', 'collocation_list', 'collocations', 'common_contexts', 'concordance', 'concordance_list', 'count', 'dispersion_plot', 'findall', 'generate', 'index', 'name', 'plot', 'readability', 'similar', 'tokens', 'vocab']

但從這些成員名稱無法得知那些是屬性, 哪些是方法, 這可以先將 dir() 傳回值指派給一個變數, 然後用迴圈以 print() 來檢視這些成員的內容, 例如 :

>>> members=dir(text1)

>>> type(members)

>>> for mbr in members: # 走訪 Text 物件成員

obj=eval('text1.' + mbr) # 用 eval() 求值取得 text1.成員之參考

if not mbr.startswith('_'): # 走訪所有不是 "_" 開頭的成員

print(mbr, type(obj))

collocation_list <class 'method'>

collocations <class 'method'>

common_contexts <class 'method'>

concordance <class 'method'>

concordance_list <class 'method'>

count <class 'method'>

dispersion_plot <class 'method'>

findall <class 'method'>

generate <class 'method'>

index <class 'method'>

name <class 'str'>

plot <class 'method'>

readability <class 'method'>

similar <class 'method'>

tokens <class 'list'>

vocab <class 'method'>

此處在迴圈中先使用求值函式 eval() 來取得 text1.成員名稱之參考, 此法可用來將以字串表示之變動物件名稱轉成物件之參考, 對於需要在迴圈中存取一群物件非常好用. 除了使用 eval() 外也可以用 vars()[text] 或 locals()[text], 參考 :

# Python：如何將字串作為變數名

接著用字串的 startswith() 濾掉所有以 '_' 開頭的成員, 留下屬性與方法, 結果中 class 是 'method' 者為方法, 其他則為屬性. 由上面的結果可知, nltk.book.text1~9 物件成員中只有 name 與 tokens 這兩個屬性, 其餘皆為方法 (method). 可用 for 迴圈來顯示這九本書的書名 :

>>> for i in range(1, 10):

text='text' + str(i) # 利用求值函式 eval() 取得 text1~9 之物件參考

print(f'{text} name={eval(text).name}') # 印出 text1~9 的 name 屬性值

text1 name=Moby Dick by Herman Melville 1851

text2 name=Sense and Sensibility by Jane Austen 1811

text3 name=The Book of Genesis

text4 name=Inaugural Address Corpus

text5 name=Chat Corpus

text6 name=Monty Python and the Holy Grail

text7 name=Wall Street Journal

text8 name=Personals Corpus

text9 name=The Man Who Was Thursday by G . K . Chesterton 1908

2. 計算 token 數目 :

所謂的 token 是自然語言處理中用來指涉一串字元的術語, 它可以是英文字, 標點符號, 數字, 或特殊符號等, 並非只是英文字 (word) 而已. 以下將檢視 NLTK 內 book 子套件底下的 text1~text9 共九本書語料中的 token 數目.

將變數 text1~text9 傳入 len() 即可計算每本書的 token 數 :

>>> len(text1)

260819

>>> len(text2)

141576

>>> len(text3)

44764

>>> len(text4)

149797

>>> len(text5)

45010

>>> len(text6)

16967

>>> len(text7)

100676

>>> len(text8)

4867

>>> len(text9)

69213

這樣一步步呼叫 len() 很冗長, 可以用 for 迴圈配合求值函式 eval() 來做 :

>>> for i in range(1, 10):

text='text' + str(i)

length=len(eval(text)) # 用 eval() 將字串 text 轉成變數

print(f'total tokens of {text} : {length}')

total tokens of text1 : 260819

total tokens of text2 : 141576

total tokens of text3 : 44764

total tokens of text4 : 149797

total tokens of text5 : 45010

total tokens of text6 : 16967

total tokens of text7 : 100676

total tokens of text8 : 4867

total tokens of text9 : 69213

注意, 此處必須先用 eval() 將字串求值以取得變數 text1~text9 的參考位址, 若用 len(text) 表示是計算 'text1', 'text2', ... 之長度, 因此將總是得到 5 (因 'text1' 字元數為 5), 而不是變數 text1~text9 內容的長度.

由上述可知, text1~9 物件的 token 屬性值為 list, 它被用來紀錄語料中的所有 token, 因為串列的元素很多, 直接 print() 其實看不到全貌, 但可用迴圈顯示前 10 個元素, 例如 :

>>> type(text1.tokens) # tokens 屬性的資料型態為串列

>>> for i in range(1, 9): # 顯示 text1~9 物件的前 10 個 token

text=eval('text' + str(i)) # 利用 eval() 字串求值取得變數

print(text.tokens[:10]) # 顯示 tokens 屬性前 10 個元素

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.']

['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']', 'CHAPTER']

['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth']

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House']

['now', 'im', 'left', 'with', 'this', 'gay', 'name', ':P', 'PART', 'hey']

['SCENE', '1', ':', '[', 'wind', ']', '[', 'clop', 'clop', 'clop']

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the']

['25', 'SEXY', 'MALE', ',', 'seeks', 'attrac', 'older', 'single', 'lady', ',']

可見 token 不僅僅是英文字而已, 還包括了數字與標點符號等等. 用 len() 檢查 text1~9 的 token 屬性可得到與上面用 len(text1)~len(text9) 相同的 token 數目, 例如 :

>>> len(text1.tokens)

260819

>>> len(text2.tokens)

141576

>>> len(text3.tokens)

44764

>>> len(text4.tokens)

149797

>>> len(text5.tokens)

45010

>>> len(text6.tokens)

16967

>>> len(text7.tokens)

100676

>>> len(text8.tokens)

4867

>>> len(text9.tokens)

69213

也可用迴圈來檢視 :

>>> for i in range(1, 10):

text='text' + str(i)

length=len(eval(text).tokens) # 用 eval() 將字串 text 轉成變數

print(f'total tokens of {text} : {length}')

total tokens of text1 : 260819

total tokens of text2 : 141576

total tokens of text3 : 44764

total tokens of text4 : 149797

total tokens of text5 : 45010

total tokens of text6 : 16967

total tokens of text7 : 100676

total tokens of text8 : 4867

total tokens of text9 : 69213

可見結果與上面用 len(text1) ~ len(text9) 是一樣的.

3. 用 set() 去除重複的 token :

len() 所統計的 token 數並未排除重複的 token, 若要去除重複的 token, 可利用 Python 的集合型態, 因為集合的元素不可重複, 都是 unique 的. 因此只要把 text1~text9 的語料傳入 set() 即可剔除重複的 token, 再將集合傳給 len() 即可得到語料中不重複計算的 token 數, 例如 :

>>> text1_set=set(text1)

>>> type(text1_set)

>>> len(text1_set)

19317

>>> len(text1)

260819

可見 text1 總 token 數有 26 萬多, 但其中有許多 token 重複出現, 經過 set() 轉成集合剔除重複的 token 後, 真正獨一無二的 token 數才 1 萬 9 千多而已. 我們可用 for 迴圈來計算 uniqe token 數 :

>>> for i in range(1, 10):

text='text' + str(i)

tokens=len(eval(text))

unique_tokens=len(set(eval(text))) # 先將語料 text? 傳給 set() 轉成集合

print(f'{text} : total tokens={tokens} unique tokens={unique_tokens}')

text1 : total tokens=260819 unique tokens=19317

text2 : total tokens=141576 unique tokens=6833

text3 : total tokens=44764 unique tokens=2789

text4 : total tokens=149797 unique tokens=9913

text5 : total tokens=45010 unique tokens=6066

text6 : total tokens=16967 unique tokens=2166

text7 : total tokens=100676 unique tokens=12408

text8 : total tokens=4867 unique tokens=1108

text9 : total tokens=69213 unique tokens=6807

可見不重複的 token 數目就少很多了.

沒有留言 :

張貼留言

訂閱：張貼留言 ( Atom )

小狐狸事務所

2021年11月14日星期日

NLTK 學習筆記 (三) : nltk.book.text1~9 語料庫

沒有留言 :

文章標籤

常用連結

2021年11月14日 星期日

NLTK 學習筆記 (三) : nltk.book.text1~9 語料庫

沒有留言 :

2021年11月14日星期日