小狐狸事務所: NLP 學習筆記 : 分詞 (tokenization) 與語料庫 (corpus)

2021年2月18日星期四

NLP 學習筆記 : 分詞 (tokenization) 與語料庫 (corpus)

今年我想開始學習 Python 自然語言處理, 最近在齋藤康毅寫的 "Deep Learning 2 : 用 Python 進行自然語言處理的基礎理論實作" (碁峰, 2019) 第二章讀到利用 Python 串列與字串物件建立基本語料庫的方法, 覺得簡單易行就隨手進行了實測.

Source : 博客來

此書譯自日文版的 "Deep Learning From Scratch 2" (Oreilly Japan, 2018) :

Source : GitHub

書中範例程式碼可在 GitHub 下載 :

# https://github.com/oreilly-japan/deep-learning-from-scratch-2

另外碁峰公司亦提供含中文註解的程式碼下載 :

# http://books.gotop.com.tw/download/A581

以下關於 Python 字串處理方法參考 :

# Python 學習筆記 : 字串處理函數與字串物件的方法

1. 分詞 (tokenization) :

建立基本的語料庫首先要分詞, 只要用字串物件的 split() 方法以空格為界進行拆分即可, 但在拆分之前必須針對標點符號用 replace() 方法進行處理, 否則標點符號會黏在字的後面 (標點符號也是一個 token). 書中的範例較簡單, 只是一個單句 :

>>> text1="You say goodbye and I say hello."

首先將整個字串用 lower() 方法轉成小寫 :

>>> text1=text1.lower() # 全部變成小寫

>>> text1

'you say goodbye and i say hello.'

在呼叫 split() 方法分詞之前必須處理標點符號問題, 否則以空格拆分時會黏在前一個詞上, 以上面的句子來說, 句尾的句點會與 hello 黏在一起, 例如 :

>>> words1=text1.split(" ") # 以空格為界拆分詞彙

>>> words1

['you', 'say', 'goodbye', 'and', 'i', 'say', 'hello.'] # 句點黏在 hello 後面了

解決辦法是先用 replace() 方法在標點符號前/後添加空格, 此例因為句點是緊跟在詞彙後面, 所以只要在句點前面添加一個空格即可 :

>>> text1="You say goodbye and I say hello."

>>> text1=text1.lower() # 轉成小寫

>>> text1=text1.replace(".", " .") # 在句點前面添加一個空格

>>> text1

'you say goodbye and i say hello .'

>>> words1=text1.split(" ") # 以空格拆分詞彙

>>> words1

['you', 'say', 'goodbye', 'and', 'i', 'say', 'hello', '.']

可見經過添加空格方式處理後, 句點已經被分離開來了. 但此例較單純, 只有一個單句一個句點, 通常句子至少會含有逗號與句號, 例如莎翁 "哈姆雷特" 中的名句 :

>>> text2="To be, or not to be, that is the question."

>>> text2=text2.lower()

>>> text2

'to be, or not to be, that is the question.'

>>> text2=text2.replace(",", " ,") # 在逗號前面添加一個空格

>>> text2=text2.replace(".", " .") # 在句點前面添加一個空格

>>> text2

'to be , or not to be , that is the question .'

>>> words2=text2.split(" ") # 以空格拆分詞彙

>>> words2

['to', 'be', ',', 'or', 'not', 'to', 'be', ',', 'that', 'is', 'the', 'question', '.']

可見詞彙與標點符號都被正確分離了. 如果還有其他標點符號 (例如 ! 或 ? 等) 都必須要用 replace() 先處理再拆分. 但目前還有個問題須處理, 上面不管 text1 或 text2 的標點都緊跟在詞彙後面, 然後是一個空格才是下一個字, 這是句子標準寫法, 但如果沒有這個空格, 上面的處理方式將破功, 例如 :

>>> text3="To be,or not to be,that is the question." # 標點符號後面沒有空格

>>> text3=text3.lower()

>>> text3

'to be,or not to be,that is the question.'

>>> text3=text3.replace(",", " ,")

>>> text3=text3.replace(".", " .")

>>> text3

'to be ,or not to be ,that is the question .'

>>> words3=text3.split(" ")

>>> words3

['to', 'be', ',or', 'not', 'to', 'be', ',that', 'is', 'the', 'question', '.']

由於兩個逗號後面沒有跟著一個空格, 因此拆分時會黏住下一個字, 變成 ',or' 與 ',that' 了. 解決辦法之一是在呼叫 replace() 方法時將標點符號前後都添加空格, 例如 :

>>> text3="To be,or not to be,that is the question." # 標點符號後面沒有空格

>>> text3=text3.lower()

>>> text3

'to be,or not to be,that is the question.'

>>> text3=text3.replace(",", " , ") # 在逗號前後都添加一個空格

>>> text3=text3.replace(".", " . ") # 在句號前後都添加一個空格

>>> text3

'to be , or not to be , that is the question . '

>>> words3=text3.split(" ")

>>> words3

['to', 'be', ',', 'or', 'not', 'to', 'be', ',', 'that', 'is', 'the', 'question', '.', '']

標點符號是成功分離了, 但這麼做卻使拆分後的串列多了一個空字串, 這可用迴圈判斷空字串刪除之, 例如 :

>>> for w in words3:

... if w=='':

... words3.remove(w) # 刪除空字串元素

...

>>> words3

['to', 'be', ',', 'or', 'not', 'to', 'be', ',', 'that', 'is', 'the', 'question', '.']

參考 :

# Remove an item from a list in Python (clear, pop, remove, del)

2. 語料庫前置處理 (corpus pre-processing) :

上面將句子分詞後得到一個串列, 但其元素仍是文本, 搜尋時只能做字串比對, 這在操作上不是很方便, 應該為其建立索引, 並製作索引與詞彙的雙向對照, 這可用兩個 dict 物件 word_to_id 與 id_to_word 來儲存, 此乃建立語料庫之前置處理 :

>>> words3 # 上面分詞的結果

['to', 'be', ',', 'or', 'not', 'to', 'be', ',', 'that', 'is', 'the', 'question', '.']

>>> word_to_id={} # 儲存字-索引對映的空字典

>>> id_to_word={} # 儲存索引-字對映的空字典

>>> for w in words3: # 迭代串列中的字

... if w not in word_to_id: # 若該字尚未在字-索引對映字典的鍵中

... new_id=len(word_to_id) # 以字典現有長度當新索引

... word_to_id[w]=new_id # 建立字-索引對映

... id_to_word[new_id]=w # 建立索引-字對映

...

>>> id_to_word

{0: 'to', 1: 'be', 2: ',', 3: 'or', 4: 'not', 5: 'that', 6: 'is', 7: 'the', 8: 'question', 9: '.'}

>>> word_to_id

{'to': 0, 'be': 1, ',': 2, 'or': 3, 'not': 4, 'that': 5, 'is': 6, 'the': 7, 'question': 8, '.': 9}

可見字典只會儲存出現的單字一次, 其實就是紀錄文本中出現了那些字以及其索引而已.

3. 建立語料庫 :

有了 word_to_id 字典就可以建立語料庫了, 所謂語料庫具體來說就是句子中每個字在字典中的索引清單, 作法很簡單, 只要使用串列生成式 (list comprehension) 一個指令即可 :

>>> corpus=[word_to_id[w] for w in words3]

>>> corpus

[0, 1, 2, 3, 4, 0, 1, 2, 5, 6, 7, 8, 9]

這個語料庫就是 "To be, or not to be, that is the question." 這句中的每個詞彙 (含標點) 在字典中的索引位置清單, 由於自然語言處理將語句視為向量進行運算, 因此需將其轉換成 Numpy 中的 ndarray 陣列 (向量) 儲存 :

>>> import numpy as np

>>> corpus=np.array(corpus)

>>> corpus

array([0, 1, 2, 3, 4, 0, 1, 2, 5, 6, 7, 8, 9])

可將上面的處理程序寫成一個 preprocess() 函數, 接受傳入的語句字串, 經過前置處理後傳回一個語料庫向量 corpus, 以及兩個字典 word_to_id 與 id_to_word, 測試程式如下 :

# corpus.py

import numpy as np

def preprocess(text):

text=text.lower()

text=text.replace(",", " , ")

text=text.replace(".", " . ")

words=text.split(" ")

for w in words:

if w=='':

words.remove(w)

word_to_id={}

id_to_word={}

for w in words:

if w not in word_to_id:

new_id=len(word_to_id)

word_to_id[w]=new_id

id_to_word[new_id]=w

corpus=np.array([word_to_id[w] for w in words])

return corpus, word_to_id, id_to_word

text="To be, or not to be, that is the question."

corpus, word_to_id, id_to_word=preprocess(text)

print(corpus)

print(word_to_id)

print(id_to_word)

執行結果如下 :

D:\Python\test>python corpus.py

[0 1 2 3 4 0 1 2 5 6 7 8 9]

{'to': 0, 'be': 1, ',': 2, 'or': 3, 'not': 4, 'that': 5, 'is': 6, 'the': 7, 'question': 8, '.': 9}

{0: 'to', 1: 'be', 2: ',', 3: 'or', 4: 'not', 5: 'that', 6: 'is', 7: 'the', 8: 'question', 9: '.'}

與上面互動式操作結果相同.

沒有留言 :

張貼留言

訂閱：張貼留言 ( Atom )

小狐狸事務所

2021年2月18日星期四

NLP 學習筆記 : 分詞 (tokenization) 與語料庫 (corpus)

沒有留言 :

文章標籤

常用連結

2021年2月18日 星期四

NLP 學習筆記 : 分詞 (tokenization) 與語料庫 (corpus)

沒有留言 :

2021年2月18日星期四