小狐狸事務所: Python 學習筆記 : 正規表示法 (實例篇)

2022年5月26日星期四

Python 學習筆記 : 正規表示法 (實例篇)

對正規表示法的原理與用法有了基本了解後, 應該透過更多常用實例來測試看看是否真正理解其用法. 本系列前面的文章參考 :

# Python 學習筆記 : 正規表示法基礎篇

本篇參考書籍 :

精通正規表達式 (歐萊里, 2012)
處理大數據的必備美工刀 :全支援中文的正規表示法精解 (上奇, 2016)
增壓的 Python : 讓程式碼進化到全新境界 (碁峰, 2020) 第六, 七章
Python 自動化的樂趣 (碁峰, 2016) 第七章

1. 比對 0~255 之間的整數數字 :

比對一個整數是否在 0~255 很常用, 例如網路位址或 RGB 色碼等都以一個 byte 為一個單位, 故其值為 0~255 的整數, 其正規式如下 :

pattern=r'''^([0-9]| # 一位數 : 0~9 不限制

[0-9]{2}| # 兩位數 : 0~9 不限制

1[0-9][0-9]| # 三位數且百位是 1 : 十位與個位 0~9 不限制

2[0-4][0-9]| # 三位數且百位是 2 十位是 0~4 : 個位 0~9 不限制

25[0-5])$''' # 三位數且百位是 2 十位是 5 : 個位 0~5

注意, 因為正規式為五個選項構成, 所以必須用小括號納為一個群組, 否則前面位數較少者會先匹配, 例如 '255' 會在第一個選項 [0-9] 只匹配 2 就停止.

例如 :

>>> import re

>>> pattern=r'^([0-9]|[0-9]{2}|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$'

>>> re.match(pattern, '9')

<re.Match object; span=(0, 1), match='9'>

>>> re.match(pattern, '99')

<re.Match object; span=(0, 2), match='99'>

>>> re.match(pattern, '199')

<re.Match object; span=(0, 3), match='199'>

>>> re.match(pattern, '249')

<re.Match object; span=(0, 3), match='249'>

>>> re.match(pattern, '255')

<re.Match object; span=(0, 3), match='255'>

可見均能正確匹配 0~255 之整數.

2. 比對台灣身分證號碼 :

台灣身分證號碼格式如下 :

第一個字元為大寫英文字母
後面跟著 9 位數字字元
第二個字元 (即第一個數字) 1=為男性; 2=為女性,
第三個字元起是 8 個 0~9 的數字, 例如 S123456789

正規式如下 :

pattern=r'''^[A-Z] # 大寫英文字母

[12] # 第一碼 1=男性, 2=女性

[0-9]{8}$''' # 後八碼是 0~9 的數字

例如 :

>>> import re

>>> pattern=r'^[A-Z][12][0-9]{8}$'

>>> re.search(pattern, 'S123456789') # 匹配

<re.Match object; span=(0, 10), match='S123456789'>

>>> re.search(pattern, 'S223456789') # 匹配

<re.Match object; span=(0, 10), match='S223456789'>

>>> re.findall(pattern, 'S323456789') # 第二字元只能 1 或 2

[]

>>> re.findall(pattern, 'a123456789') # 第一字元須大寫

[]

>>> re.findall(pattern, 'A12345678') # 碼數不足 9 碼

[]

前兩例目標字串與正規式相符故匹配; 第三例的第二字元為 3 不匹配, 後兩例因第一字元小寫與碼數不足而不匹配.

參考 :

# https://ithelp.ithome.com.tw/articles/10196283

3. 比對台灣手機號碼 :

在前一篇正規式基礎篇中曾以台灣固網電話碼為例說明群組之用途, 行動電話號碼的比對方式也是類似固網, 但此處要比對的是下列三種格式 :

09-33123456
0933-123456
0933123456

台灣行動電話字頭為 09, 接著可能有一個 '-' 字元, 然後是 8 碼數字. 或者 09 後面有兩碼原始業者字頭, 接著可能有一個 '-' 字元, 然後是 6 碼數字, 其正規式如下 :

pattern=r'''(09[-]?\d{8}| # 09 後面跟著可有可無的 '-', 接著是 8 碼數字

09\d{2}[-]?\d{6})''' # 09 後面跟著 2 碼數字與可有可無的 '-', 接著是 6 碼數字

例如 :

>>> import re

>>> pattern=r'(09[-]?\d{8}|09\d{2}[-]?\d{6})'

>>> re.findall(pattern, '09-33123456') # 匹配

['09-33123456']

>>> re.findall(pattern, '0933-123456') # 匹配

['0933-123456']

>>> re.findall(pattern, '0933123456') # 匹配

['0933123456']

>>> re.findall(pattern, '0833123456') # 不匹配 : 不是 09 開頭

[]

>>> re.findall(pattern, '093312345') # 不匹配 : 碼數不足

[]

>>> re.findall(pattern, '0933-12345') # 不匹配 : 碼數不足

[]

可見符合上列三種寫法的號碼均匹配.

4. 比對電子郵件信箱 :

電子郵件信箱以 @ (英文念 at) 為界, 前面是使用者名稱 (稱為 local name); 後面是郵件主機網址 (稱為 domain name). 在 RFC 規範中使用者名稱最長 64 字元, 可以使用小數點但不可在開頭或連續, 整個 e-mail 總長最多 255 字元 ... 規則非常複雜, 但此處使用如下的簡化正規式 :

pattern=r'[\w.]+@[\w.]+'

其中字元集裡 \w 表示英文字母, 數字或底線, 小數點不須跳脫, + 表示這種字元會出現 1 次以上, 例如 :

>>> import re

>>> pattern=r'[\w.]+@[\w.]+'

<re.Match object; span=(0, 16), match='abc123@gmail.com'>

>>> re.search(pattern, 'abc.tw@gmail.com')

<re.Match object; span=(0, 16), match='abc.tw@gmail.com'>

>>> re.findall(pattern, 'To:abc.tw@gmail.com;abc123@gmail.com')

['abc.tw@gmail.com', 'abc123@gmail.com']

在 "Python 自動化的樂趣" 這本書裡, 作者使用下列正規式來比對 e-mail :

pattern=r'''[a-zA-Z0-9._%+-]+ # 使用者名稱

[a-zA-Z0-9.-]+ # 網域名稱

(\.[a-zA-Z]{2,4})''' # .com/.org, ...

例如 :

>>> import re

>>> pattern=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+(\.[a-zA-Z]{2,4})'

>>> re.match(pattern, 'abc.tw@gmail.com')

<re.Match object; span=(0, 16), match='abc.tw@gmail.com'>

>>> re.search(pattern, 'abc.tw@gmail.com')

<re.Match object; span=(0, 16), match='abc.tw@gmail.com'>

>>> re.search(pattern, 'abc.tw@yahoo.com.tw')

<re.Match object; span=(0, 19), match='abc.tw@yahoo.com.tw'>

>>> re.findall(pattern, 'abc123@gmail.com')

['.com']

>>> re.findall(pattern, 'To:abc.tw@gmail.com;abc123@gmail.com')

['.com', '.com']

可見 re.match() 與 re.search() 都可匹配 e-mail, 但 re.findall() 卻只匹配最後的 (\.[a-zA-Z]{2,4}).

在 "處理大數據的必備美工刀 " 這本書附錄 B 有介紹符合 RFC 規範的 e-mail 正規式 :

pattern=r'''(?!.)(?![\w.]*?\.\.)[\w.]{1,64}@ # 環視 (?!.) 禁小數點開頭, 後禁連續小數點

(?=[-a-zA-Z0-9.]{0,255}(?![-a-zA-Z0-9.])) # 總長不超過 255 字元

((?!-)[-a-zA-Z0-9]{1,63}\.)* # 重複的 '欄位.' 結構

((?!-)[-a-zA-Z0-9]){1,63}''' # 必要的最後欄位

這裡用到了正規表示法進階功能的環視 (look around) 與斷言 (assertion), 真的很複雜. 但我實際測試卻 NG, 檢查了幾遍確認並未打錯啊, 奇怪 :

>>> import re

>>> pattern=r'(?!.)(?![\w.]*?\.\.)[\w.]{1,64}@(?=[-a-zA-Z0-9.]{0,255}(?![-a-zA-Z0-9.]))((?!-)[-a-zA-Z0-9]{1,63}\.)*((?!-)[-a-zA-Z0-9]){1,63}'

>>> re.search(pattern, 'abc123@gmail.com')

>>> re.search(pattern, 'abc.tw@g-mail.com.tw')

>>> re.findall(pattern, 'To:abc.tw@gmail.com;abc123@gmail.com')

[]

有空再回頭研究哪裡有問題.

5. 比對網址 (URL) :

URL 的格式如下 :

[協定]://[主機位址]:[埠號]/[檔案路徑][檔名]?[查詢]#[ID]

其中協定有 http, https, 與 ftp 等, 所以較簡單的 URL 正規式可以這麼寫 :

pattern=r'''(https?|ftp):// # 協定部分, ? 表示前面 s 可有可無

[\w.]+''' # 可以是英文字母, 數字, 底線, 小數點

例如 :

>>> import re

>>> pattern=r'(https?|ftp)://[\w.]+'

>>> re.search(pattern, 'http://google.com.tw')

<re.Match object; span=(0, 20), match='http://google.com.tw'>

>>> re.search(pattern, 'https://google.com.tw')

<re.Match object; span=(0, 21), match='https://google.com.tw'>

>>> re.search(pattern, 'ftp://abc.com.tw')

<re.Match object; span=(0, 16), match='ftp://abc.com.tw'>

但是這個簡單的正規式只有比對到主機而已, 後面的埠號, 檔名與查詢參數均未納入.

在 "處理大數據的必備美工刀 " 這本書附錄 B 有介紹一個較精細的 URL 正規式 (雖然也還不是嚴格的 RFC 格式) :

pattern=r"""(https?|ftp):// # 協定部分, ? 表示前面 s 可有可無

[^?:]+ # 主機名稱

(:[0-9]{1,5})? # 埠號 (可有可無)

(/?|(/[^/]+)* # 檔案路徑與檔名

(\?[^\s"']+)?)""" # 查詢參數 (可有可無), ? 要跳脫

此處第三列埠號其實只到 65535 而已, 但為了簡化用 [0-9]{1,5} 使得最高 99999 埠也會匹配.

例如 :

>>> import re

>>> pattern=r"""(https?|ftp)://[^?:]+(:[0-9]{1,5})?(/?|(/[^/]+)*(\?[^\s"']+)?)"""

>>> re.search(pattern, 'https://google.com.tw')

<re.Match object; span=(0, 21), match='https://google.com.tw'>

>>> re.search(pattern, 'http://google.com.tw')

<re.Match object; span=(0, 20), match='http://google.com.tw'>

>>> re.search(pattern, 'ftp://abc.com.tw')

<re.Match object; span=(0, 16), match='ftp://abc.com.tw'>

>>> re.search(pattern, 'https://google.com.tw:80')

<re.Match object; span=(0, 24), match='https://google.com.tw:80'>

>>> re.search(pattern, 'https://abc.com.tw:5000/get_stocks?id=2330')

<re.Match object; span=(0, 24), match='https://abc.com.tw:5000/'>

此正規式可以比對到埠號都沒問題, 但後面的檔案名稱與查詢參數都沒比對出來, 我仔細研究後發現這是因為書裡的正規式將埠號後面的 /? 放在管線 | 的前面所致, 因為它會先貝匹配然後比對就停止了. 下面是將 /? 放到管線 | 後面的結果 :

>>> import re

>>> pattern=r"""(https?|ftp)://[^?:]+(:[0-9]{1,5})?((/[^/]+)*(\?[^\s"']+)?|/?)"""

>>> re.search(pattern, 'https://abc.com.tw:5000/stock/get_stocks?id=2330')

<re.Match object; span=(0, 48), match='https://abc.com.tw:5000/stock/get_stocks?id=2330'>

>>> re.search(pattern, 'https://abc.com.tw:5000/get_stocks?id=2330')

<re.Match object; span=(0, 42), match='https://abc.com.tw:5000/get_stocks?id=2330'>

>>> re.search(pattern, 'https://abc.com.tw:5000/?id=2330')

<re.Match object; span=(0, 32), match='https://abc.com.tw:5000/?id=2330'>

>>> re.search(pattern, 'https://abc.com.tw:5000/')

<re.Match object; span=(0, 23), match='https://abc.com.tw:5000'>

這樣檔案路徑與參數就都會匹配了, 但是最後一個例子結尾的 '/' 不見了 (???).

另外, 在 "精通正規表達式" 這本書的第五章 HTTP 範例中使用的 URL 正規式相對簡單容易理解, 我稍做修改其中協定部分, 擴充為可比對 http/https/ftp 三種協定, 正規式如下 :

pattern=r'''(https?|ftp):// # 三種協定

([^/:]+) # 主機

(:(\d+))? # 埠號 (可有可無)

(/.*)?''' # 路徑檔案與查詢參數等

例如 :

>>> import re

>>> pattern=r'(https?|ftp)://([^/:]+)(:(\d+))?(/.*)?'

>>> re.search(pattern, 'https://abc.com.tw:5000/stock/get_stocks?id=2330')

<re.Match object; span=(0, 48), match='https://abc.com.tw:5000/stock/get_stocks?id=2330'>

>>> re.search(pattern, 'https://abc.com.tw:5000/get_stocks?id=2330')

<re.Match object; span=(0, 42), match='https://abc.com.tw:5000/get_stocks?id=2330'>

>>> re.search(pattern, 'https://abc.com.tw:5000/?id=2330')

<re.Match object; span=(0, 32), match='https://abc.com.tw:5000/?id=2330'>

>>> re.search(pattern, 'https://abc.com.tw:5000/')

<re.Match object; span=(0, 24), match='https://abc.com.tw:5000/'>

可見此正規式簡單又好用.

6. 比對網頁中的超連結 (href) :

網頁中的超連結放在 a 標籤的 href 屬性中, 例如 :

<a href="http://www.google.com.tw">Google</a>

<a href='http://www.google.com.tw'>Google</a>

<a href=http://www.google.com.tw>Google</a>

可見 href 屬性的值可以用單引號或雙引號括起來, 也可以不需要引號. 從網頁 HTML 原始碼中擷取超連結網址的正規式如下 :

pattern=r'''href=

[\'"]? # 可有可無的引號 (第一個是單引須跳脫)

([^\'" >]+) # URL (一個以上非引號與 > 字元)

[\'"]?''' # 可有可無的引號 (第一個是單引須跳脫)

其中 URL 部分用小括號做成分組, 所以呼叫 group(1) 會傳回不包含 href 的 URL 本身. 這是我從下列這篇論壇討論文章中拿來修改的 :

# Regular expression to extract URL from an HTML link

注意, 由於長原始字串使用三個單引號, 所以字元集裡面的單引號須跳脫 (雙引號不用). 字元集裡面除了 '-' (表示放中間時) 與 ']' (任何處) 須跳脫外, 其餘特殊符號都不須跳脫 (跳也沒關係), 但引號就要看正規式字串外圍是用甚麼而定, 用單引號就要跳單引號, 用雙引號就要跳雙引號, 否則正規式會被截斷, 可能會出現錯誤. 例如 :

>>> import re

>>> html='''

<a href="http://www.google.com" target="_blank">Google</a>

<a href='http://tw.yahoo.com' >Yahoo Taiwan</a>

<a href=http://twitter.com >Yahoo Taiwan</a>'''

>>> match=re.search(r'href=[\'"]?([^\'" >]+)[\'"]?', html)

>>> match

<re.Match object; span=(4, 32), match='href="http://www.google.com"'>

>>> match.group()

'href="http://www.google.com"'

>>> match.group(1)

'http://www.google.com'

可見呼叫 Match 物件的 group() 或 group(0) 會傳回整個匹配字串, 呼叫 group(1) 則傳回第一個分組的匹配字串, 也就是 URL 的部分; 但呼叫 re.findall() 則是將各分組匹配字串放在串列中傳回, 例如 :

>>> re.findall(r'href=[\'"]?([^\'" >]+)[\'"]?', html)

['http://www.google.com', 'http://tw.yahoo.com', 'http://twitter.com']

但這個簡單的正規式沒有處理標籤 a 後面與 href 左右可能有超過 1 個空格的情形, 對於不標準的網頁可能會無法匹配全部超連結內的 URL, 例如 :

>>> html='''

<a href="http://www.google.com" target="_blank">Google</a>

<a href = 'http://tw.yahoo.com' >Yahoo Taiwan</a>

<a href= http://twitter.com >Yahoo Taiwan</a>'''

>>> re.findall(r'<a[^>]+href=["\']?(.*)["\']?', html)

['http://www.google.com" target="_blank">Google</a>', ' http://twitter.com >Yahoo Taiwan</a>']

此例中的網頁原始碼與上面不同, 刻意在第二個與第三個 a 標籤的 href 周圍留了空格, 可見此正規式無法抓出全部 URL.

在 "處理大數據的必備美工刀 " 這本書附錄 B-19 頁介紹了一個較精細的超連結 URL 正規式, 它就有對 href 周邊空格進行處理, 但原式有點複雜, 我先將其簡化修改如下 :

pattern=r'''<a\s+ # 超連結標籤 (後有一個以上空格)

href\s*=\s* # href 屬性 (等號前後可能有 0 個以上空格)

["\']?([^"\'\s]+)["\']?''' # 將 URL 分組 (前後可能有引號)

此正規式以 \s 來代表空格 (事實上還包括 Tab 等字元), URL 部分以分組 ([^"\'\s]+) 捕捉, 例如 :

>>> html='''

<a href="http://www.google.com" target="_blank">Google</a>

<a href = 'http://tw.yahoo.com' >Yahoo Taiwan</a>

<a href= http://twitter.com >Yahoo Taiwan</a>'''

>>> re.search(r'<a\s+href\s*=\s*["\']?([^"\'\s]+)["\']?', html)

<re.Match object; span=(1, 32), match='<a href="http://www.google.com"'>

>>> re.findall(r'<a\s+href\s*=\s*["\']?([^"\'\s]+)["\']?', html)

['http://www.google.com', 'http://tw.yahoo.com', 'http://twitter.com']

嗯, 這個正規式比較優.

7. 比對網頁中的圖片網址 (src) :

網頁中的圖片以 img 標籤呈現 (沒有結束標籤), 圖片的網址放在 src 屬性中, 例如 :

擷取圖片網址的正規式與上面擷取網址的類似, 只要將 a 標籤改成 img 標籤, 將 href 屬性改成 src 屬性即可 :

pattern=r'''<img\s+ # 圖片標籤 (後有一個以上空格)

src\s*=\s* # src 屬性 (等號前後可能有 0 個以上空格)

["\']?([^"\'\s]+)["\']?''' # 將 URL 分組 (前後可能有引號)

例如 :

>>> import re

>>> html='''

<img src="http://abc.com.tw/cat.jpg">

<img src = /images/dog.jpg width=300 height=200>

<img src = 'bird.jpg' alt='bird'>'''

>>> re.findall(r'<img\s+src\s*=\s*["\']?([^"\'\s]+)["\']?', html)

['http://abc.com.tw/cat.jpg', '/images/dog.jpg', 'bird.jpg']

但是有些網頁的 src 不見得是緊接在 img 後面, 可能穿插其它屬性, 這樣上面的簡化正規式就會比對破功了, 例如 :

>>> html='''

<img src="http://abc.com.tw/cat.jpg">

<img src = /images/dog.jpg width=300 height=200>

<img border='1' src = 'bird.jpg'>

<img alt=deer src = 'deer.jpg'>'''

>>> re.findall(r'<img\s+src\s*=\s*["\']?([^"\'\s]+)["\']?', html)

['http://abc.com.tw/cat.jpg', '/images/dog.jpg'] # 後面兩張圖片之 URL 沒抓到

此例第三, 四張圖片的 src 前面分別有 border 與 src 參數, 導致這兩張圖片不匹配, 解決辦法就是在正規式的 src 前面添加可能會出現的東西, 想法很簡單, src 前面會出現的是某個屬性的值, 它可能用單引號或雙引號括起來, 但也可能沒有, 就像此例中的 deer 一樣, 因此只要匹配這三種情況即可, 修改後的正規式如下 :

pattern=r'''<img # 圖片標籤 (後有一個以上空格)

[^>]* # 任何不是結束標籤的字元

\s+src\s*=\s* # src 屬性 (等號前後可能有 0 個以上空格)

["\']?([^"\'\s]+)["\']?''' # 將 URL 分組 (前後可能有引號)

此處我在 img 後面添加上面用過的 [^>]* 來表示 src 之前的任何可能的屬性設定. 注意, 長字串時字元集面的引號都不須要跳脫 (但跳脫也無妨), 但若寫成單列的短字串時, 就要看正規式字串整個外面是用單引還是雙引, 用哪個就跳哪個, 否則字串會被提早截斷而錯誤, 例如 :

>>> html='''

<img src="http://abc.com.tw/cat.jpg">

<img src = /images/dog.jpg width=300 height=200>

<img border='1' src = 'bird.jpg'>

<img alt=deer src = 'deer.jpg'>'''

>>> re.findall(r'<img[^>]*\s+src\s*=\s*["\']?([^"\'\s]+)["\']?', html)

['http://abc.com.tw/cat.jpg', '/images/dog.jpg', 'bird.jpg', 'deer.jpg']

可見四個圖片 URL 都能順利捕捉了.

所以上面的網頁超連結也可以用 [^>]* 來處理 href 前面有其它屬性的問題 :

>>> html='''

<a target="_blank" href="http://www.google.com">Google</a>

<a target="_blank" href = 'http://tw.yahoo.com' >Yahoo Taiwan</a>

<a href= http://twitter.com >Yahoo Taiwan</a>'''

>>> re.findall(r'<a[^>]*\s+href\s*=\s*["\']?([^"\'\s]+)["\']?', html)

['http://www.google.com', 'http://tw.yahoo.com', 'http://twitter.com']

此例刻意將 target="_blank" 放在 href 前面仍能正確匹配三個 URL.

茲將以上所測試隻常用正規式表列如下 :

常見的比對任務	正規式
比對 0~255 的整數	([0-9]\|[0-9]{2}\|1[0-9][0-9]\|2[0-4][0-9]\|25[0-5])
比對台灣身分證號碼	[A-Z][12][0-9]{8}
比對台灣手機號碼	(09[-]?\d{8}\|09\d{2}[-]?\d{6})
比對電子郵件信箱	[\w.]+@[\w.]+
比對網址 (URL)	(https?\|ftp)://([^/:]+)(:(\d+))?(/.*)?
比對網頁中的超連結	<a[^>]\s+href\s=\s*["\']?([^"\'\s]+)["\']?
比對網頁中的圖片網址	<img[^>]\s+src\s=\s*["\']?([^"\'\s]+)["\']?