小狐狸事務所: Python 學習筆記 : 網頁爬蟲實戰 (十六) books.toscrape.com 的書籍分頁資料

['A Light in the Attic', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History of Humankind', 'The Requiem Red', 'The Dirty Little Secrets of Getting Your Dream Job', 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'The Black Maria', 'Starving Hearts (Triangular Trade Trilogy, #1)', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'Rip it Up and Start Again', 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'Olio', 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'Libertarianism for Beginners', "It's Only the Himalayas"]

接下來同樣利用選擇器從 p 元素取得書價資訊 :

>>> prices=soup.select('article > div.product_price > p.price_color')

>>> len(p_prices)

>>> prices[0].text

'Â£51.77'

>>> prices[1].text

'Â£53.74'

由於英鎊符號在轉成 utf-8 時會出現怪碼, 可用正規式取出書價的數值部分即可 :

>>> import re

>>> re.findall(r'\d+\.\d+', prices[0].text)

['51.77']

>>> re.findall(r'\d+\.\d+', prices[0].text)[0]

'51.77'

>>> float(re.findall(r'\d+\.\d+', prices[0].text)[0])

51.77

同樣使用串列生成式從 p 元素的 Tag 物件串列中取得書價組成串列 :

>>> book_prices=[float(re.findall(r'\d+\.\d+', price.text)[0]) for price in prices]

>>> book_prices

[51.77, 53.74, 50.1, 47.82, 54.23, 22.65, 33.34, 17.93, 22.6, 52.15, 13.99, 20.66, 17.46, 52.29, 35.02, 57.25, 23.88, 37.59, 51.33, 45.17]

以上我們已經得到 a 元素 Tag 物件串列 links (可取得書名) 與 p 元素 Tag 物件串列 (可取得書價), 我們可以用 zip 函式將此二組 Tag 物件串列合組成 zip 物件, 以便能在迴圈中走訪 (書名, 書價) 對, 關於 zip() 用法參考 :

# Python 學習筆記 : 內建函式 zip() 的用法

>>> zipped=zip(book_names, book_prices) # 將書名與書價串列組成 zip 對

>>> type(zipped)

然後走訪此 zip 物件, 並使用串列生成式分別從兩組 Tag 物件串列對中取出書名與書價組成 tuple 串列 data, 這樣就能將書名與書價綁在 tuple 中了, 方便寫入 csv 檔中 :

>>> data=[(z[0], z[1]) for z in zipped] # 將被 zip 的書名與書價組成 tuple list

>>> data[0]

('A Light in the Attic', '51.77')

>>> data[1]

('Tipping the Velvet', '53.74')

然後匯入 csv 套件直接將此 tuple list 寫入 csv 檔即可 :

>>> with open('books_page_1.csv', 'w', newline='', encoding='utf-8') as f:

writer=csv.writer(f)

writer.writerows(data)

用 Excel 開啟 books_page_1.csv 檔 :

可見已將第一頁的書名與書價資訊成功寫入 csv 檔了.

接下來只要用迴圈去逐頁抓取書名與價格並寫入 csv 檔即可, 此網站雖然已知有 50 頁, 但網站可能在既有結構下增加書籍數量, 分頁數可能因此而增加, 因此為了增強爬蟲程式的強固性, 可以先抓首頁 (其實就是第一頁) 底下的分頁數來確定迴圈要跑幾圈, 這個分頁標示是放在 class="current" 的 li 元素內 :

在 Element 頁籤中按 Ctrl+F 搜尋 "current" 發現只有一個, 因此可以用 select_one() 來尋找它 :

>>> url='https://books.toscrape.com/'

>>> res=requests.get(url)

>>> soup=BeautifulSoup(res.text, 'lxml')

>>> page_li=soup.select_one('.current')

>>> page_li

Page 1 of 50

</li>

>>> page_li.text

'\n \n Page 1 of 50\n \n '

>>> page_li.text.strip()

'Page 1 of 50'

>>> pages=int(page_li.text.strip().split('of')[1])

>>> pages

以上測試之完整程式碼如下 :

# books_toscrape_1.py

import requests

from bs4 import BeautifulSoup

import re

import csv

import time

url=f'https://books.toscrape.com/'

res=requests.get(url)

soup=BeautifulSoup(res.text, 'lxml')

page_li=soup.select_one('.current')

pages=int(page_li.text.strip().split('of')[1])

csv_file='books.toscrape.com.csv'

with open(csv_file, 'w+', newline='', encoding='utf-8') as f:

writer=csv.writer(f)

for i in range(pages):

print(f'擷取第 {i + 1} 頁 ... ', end='')

url=f'https://books.toscrape.com/catalogue/page-{i+1}.html'

res=requests.get(url)

soup=BeautifulSoup(res.text, 'lxml')

links=soup.select('article > h3 > a')

prices=soup.select('article > div.product_price > p.price_color')

book_names=[link.get('title') for link in links]

reg=r'\d+\.\d+'

book_prices=[float(re.findall(reg, price.text)[0]) for price in prices]

zipped=zip(book_names, book_prices)

data=[(z[0], z[1]) for z in zipped]

writer.writerows(data)

print(f'存檔完成!')

time.sleep(0.5)

執行結果如下 :

>>> %Run books_toscrape_1.py

擷取第 1 頁 ... 存檔完成!

擷取第 2 頁 ... 存檔完成!

擷取第 3 頁 ... 存檔完成!

擷取第 4 頁 ... 存檔完成!

擷取第 5 頁 ... 存檔完成!

擷取第 6 頁 ... 存檔完成!

擷取第 7 頁 ... 存檔完成!

擷取第 8 頁 ... 存檔完成!

擷取第 9 頁 ... 存檔完成!

擷取第 10 頁 ... 存檔完成!

擷取第 11 頁 ... 存檔完成!

擷取第 12 頁 ... 存檔完成!

擷取第 13 頁 ... 存檔完成!

擷取第 14 頁 ... 存檔完成!

擷取第 15 頁 ... 存檔完成!

擷取第 16 頁 ... 存檔完成!

擷取第 17 頁 ... 存檔完成!

擷取第 18 頁 ... 存檔完成!

擷取第 19 頁 ... 存檔完成!

擷取第 20 頁 ... 存檔完成!

擷取第 21 頁 ... 存檔完成!

擷取第 22 頁 ... 存檔完成!

擷取第 23 頁 ... 存檔完成!

擷取第 24 頁 ... 存檔完成!

擷取第 25 頁 ... 存檔完成!

擷取第 26 頁 ... 存檔完成!

擷取第 27 頁 ... 存檔完成!

擷取第 28 頁 ... 存檔完成!

擷取第 29 頁 ... 存檔完成!

擷取第 30 頁 ... 存檔完成!

擷取第 31 頁 ... 存檔完成!

擷取第 32 頁 ... 存檔完成!

擷取第 33 頁 ... 存檔完成!

擷取第 34 頁 ... 存檔完成!

擷取第 35 頁 ... 存檔完成!

擷取第 36 頁 ... 存檔完成!

擷取第 37 頁 ... 存檔完成!

擷取第 38 頁 ... 存檔完成!

擷取第 39 頁 ... 存檔完成!

擷取第 40 頁 ... 存檔完成!

擷取第 41 頁 ... 存檔完成!

擷取第 42 頁 ... 存檔完成!

擷取第 43 頁 ... 存檔完成!

擷取第 44 頁 ... 存檔完成!

擷取第 45 頁 ... 存檔完成!

擷取第 46 頁 ... 存檔完成!

擷取第 47 頁 ... 存檔完成!

擷取第 48 頁 ... 存檔完成!

擷取第 49 頁 ... 存檔完成!

擷取第 50 頁 ... 存檔完成!

開啟 books.toscrape.com.csv 檔 :

完整抓到 50 頁共 1000 本書的書名與書價資料, 大功告成!

沒有留言 :

張貼留言

訂閱：張貼留言 ( Atom )

小狐狸事務所

2024年6月30日星期日

Python 學習筆記 : 網頁爬蟲實戰 (十六) books.toscrape.com 的書籍分頁資料

沒有留言 :

文章標籤

常用連結

2024年6月30日 星期日

Python 學習筆記 : 網頁爬蟲實戰 (十六) books.toscrape.com 的書籍分頁資料

沒有留言 :

2024年6月30日星期日