文章詳情頁

python爬蟲請求庫httpx和parsel解析庫的使用測評

瀏覽：4日期：2022-06-20 10:50:23

Python網絡爬蟲領域兩個最新的比較火的工具莫過于httpx和parsel了。httpx號稱下一代的新一代的網絡請求庫，不僅支持requests庫的所有操作，還能發送異步請求，為編寫異步爬蟲提供了便利。parsel最初集成在著名Python爬蟲框架Scrapy中，后獨立出來成立一個單獨的模塊，支持XPath選擇器, CSS選擇器和正則表達式等多種解析提取方式, 據說相比于BeautifulSoup，parsel的解析效率更高。

今天我們就以爬取鏈家網上的二手房在售房產信息為例，來測評下httpx和parsel這兩個庫。為了節約時間，我們以爬取上海市浦東新區500萬元-800萬元以上的房產為例。

requests + BeautifulSoup組合

首先上場的是Requests + BeautifulSoup組合，這也是大多數人剛學習Python爬蟲時使用的組合。本例中爬蟲的入口url是https://sh.lianjia.com/ershoufang/pudong/a3p5/, 先發送請求獲取最大頁數，然后循環發送請求解析單個頁面提取我們所要的信息（比如小區名，樓層，朝向，總價，單價等信息)，最后導出csv文件。如果你正在閱讀本文，相信你對Python爬蟲已經有了一定了解，所以我們不會詳細解釋每一行代碼。

整個項目代碼如下所示：

# homelink_requests.py# Author: 大江狗 from fake_useragent import UserAgent import requests from bs4 import BeautifulSoup import csv import re import time class HomeLinkSpider(object): def __init__(self): self.ua = UserAgent() self.headers = {'User-Agent': self.ua.random} self.data = list() self.path = '浦東_三房_500_800萬.csv' self.url = 'https://sh.lianjia.com/ershoufang/pudong/a3p5/' def get_max_page(self): response = requests.get(self.url, headers=self.headers) if response.status_code == 200: soup = BeautifulSoup(response.text, ’html.parser’) a = soup.select(’div[class='page-box house-lst-page-box']’) #使用eval是字符串轉化為字典格式 max_page = eval(a[0].attrs['page-data'])['totalPage'] return max_page else: print('請求失敗 status:{}'.format(response.status_code)) return None def parse_page(self): max_page = self.get_max_page() for i in range(1, max_page + 1): url = ’https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/’.format(i) response = requests.get(url, headers=self.headers) soup = BeautifulSoup(response.text, ’html.parser’) ul = soup.find_all('ul', class_='sellListContent') li_list = ul[0].select('li') for li in li_list: detail = dict() detail[’title’] = li.select(’div[class='title']’)[0].get_text() # 2室1廳 | 74.14平米 | 南 | 精裝 | 高樓層(共6層) | 1999年建 | 板樓 house_info = li.select(’div[class='houseInfo']’)[0].get_text() house_info_list = house_info.split(' | ') detail[’bedroom’] = house_info_list[0] detail[’area’] = house_info_list[1] detail[’direction’] = house_info_list[2] floor_pattern = re.compile(r’d{1,2}’) # 從字符串任意位置匹配 match1 = re.search(floor_pattern, house_info_list[4]) if match1: detail[’floor’] = match1.group() else: detail[’floor’] = '未知' # 匹配年份 year_pattern = re.compile(r’d{4}’) match2 = re.search(year_pattern, house_info_list[5]) if match2: detail[’year’] = match2.group() else: detail[’year’] = '未知' # 文蘭小區 - 塘橋，提取小區名和哈快 position_info = li.select(’div[class='positionInfo']’)[0].get_text().split(’ - ’) detail[’house’] = position_info[0] detail[’location’] = position_info[1] # 650萬，匹配650 price_pattern = re.compile(r’d+’) total_price = li.select(’div[class='totalPrice']’)[0].get_text() detail[’total_price’] = re.search(price_pattern, total_price).group() # 單價64182元/平米，匹配64182 unit_price = li.select(’div[class='unitPrice']’)[0].get_text() detail[’unit_price’] = re.search(price_pattern, unit_price).group() self.data.append(detail) def write_csv_file(self): head = ['標題', '小區', '房廳', '面積', '朝向', '樓層', '年份', '位置', '總價(萬)', '單價(元/平方米)'] keys = ['title', 'house', 'bedroom', 'area', 'direction', 'floor', 'year', 'location', 'total_price', 'unit_price'] try: with open(self.path, ’w’, newline=’’, encoding=’utf_8_sig’) as csv_file: writer = csv.writer(csv_file, dialect=’excel’) if head is not None: writer.writerow(head) for item in self.data: row_data = [] for k in keys: row_data.append(item[k]) # print(row_data) writer.writerow(row_data) print('Write a CSV file to path %s Successful.' % self.path) except Exception as e: print('Fail to write CSV to path: %s, Case: %s' % (self.path, e)) if __name__ == ’__main__’: start = time.time() home_link_spider = HomeLinkSpider() home_link_spider.parse_page() home_link_spider.write_csv_file() end = time.time() print('耗時：{}秒'.format(end-start))

注意：我們使用了fake_useragent, requests和BeautifulSoup，這些都需要通過pip事先安裝好才能用。

現在我們來看下爬取結果，耗時約18.5秒，總共爬取580條數據。

python爬蟲請求庫httpx和parsel解析庫的使用測評

requests + parsel組合

這次我們同樣采用requests獲取目標網頁內容，使用parsel庫(事先需通過pip安裝)來解析。Parsel庫的用法和BeautifulSoup相似，都是先創建實例，然后使用各種選擇器提取DOM元素和數據，但語法上稍有不同。Beautiful有自己的語法規則，而Parsel庫支持標準的css選擇器和xpath選擇器, 通過get方法或getall方法獲取文本或屬性值，使用起來更方便。

# BeautifulSoup的用法 from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, ’html.parser’) ul = soup.find_all('ul', class_='sellListContent')[0] # Parsel的用法, 使用Selector類 from parsel import Selector selector = Selector(response.text) ul = selector.css(’ul.sellListContent’)[0] # Parsel獲取文本值或屬性值案例 selector.css(’div.title span::text’).get() selector.css(’ul li a::attr(href)’).get() >>> for li in selector.css(’ul > li’): ... print(li.xpath(’.//@href’).get())

注：老版的parsel庫使用extract()或extract_first()方法獲取文本或屬性值，在新版中已被get()和getall()方法替代。

全部代碼如下所示：

# homelink_parsel.py # Author: 大江狗 from fake_useragent import UserAgent import requests import csv import re import time from parsel import Selector class HomeLinkSpider(object): def __init__(self): self.ua = UserAgent() self.headers = {'User-Agent': self.ua.random} self.data = list() self.path = '浦東_三房_500_800萬.csv' self.url = 'https://sh.lianjia.com/ershoufang/pudong/a3p5/' def get_max_page(self): response = requests.get(self.url, headers=self.headers) if response.status_code == 200: # 創建Selector類實例 selector = Selector(response.text) # 采用css選擇器獲取最大頁碼div Boxl a = selector.css(’div[class='page-box house-lst-page-box']’) # 使用eval將page-data的json字符串轉化為字典格式 max_page = eval(a[0].xpath(’//@page-data’).get())['totalPage'] print('最大頁碼數:{}'.format(max_page)) return max_page else: print('請求失敗 status:{}'.format(response.status_code)) return None def parse_page(self): max_page = self.get_max_page() for i in range(1, max_page + 1): url = ’https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/’.format(i) response = requests.get(url, headers=self.headers) selector = Selector(response.text) ul = selector.css(’ul.sellListContent’)[0] li_list = ul.css(’li’) for li in li_list: detail = dict() detail[’title’] = li.css(’div.title a::text’).get() # 2室1廳 | 74.14平米 | 南 | 精裝 | 高樓層(共6層) | 1999年建 | 板樓 house_info = li.css(’div.houseInfo::text’).get() house_info_list = house_info.split(' | ') detail[’bedroom’] = house_info_list[0] detail[’area’] = house_info_list[1] detail[’direction’] = house_info_list[2] floor_pattern = re.compile(r’d{1,2}’) match1 = re.search(floor_pattern, house_info_list[4]) # 從字符串任意位置匹配 if match1: detail[’floor’] = match1.group() else: detail[’floor’] = '未知' # 匹配年份 year_pattern = re.compile(r’d{4}’) match2 = re.search(year_pattern, house_info_list[5]) if match2: detail[’year’] = match2.group() else: detail[’year’] = '未知' # 文蘭小區 - 塘橋提取小區名和哈快 position_info = li.css(’div.positionInfo a::text’).getall() detail[’house’] = position_info[0] detail[’location’] = position_info[1] # 650萬，匹配650 price_pattern = re.compile(r’d+’) total_price = li.css(’div.totalPrice span::text’).get() detail[’total_price’] = re.search(price_pattern, total_price).group() # 單價64182元/平米，匹配64182 unit_price = li.css(’div.unitPrice span::text’).get() detail[’unit_price’] = re.search(price_pattern, unit_price).group() self.data.append(detail) def write_csv_file(self): head = ['標題', '小區', '房廳', '面積', '朝向', '樓層', '年份', '位置', '總價(萬)', '單價(元/平方米)'] keys = ['title', 'house', 'bedroom', 'area', 'direction', 'floor', 'year', 'location', 'total_price', 'unit_price'] try: with open(self.path, ’w’, newline=’’, encoding=’utf_8_sig’) as csv_file: writer = csv.writer(csv_file, dialect=’excel’) if head is not None: writer.writerow(head) for item in self.data: row_data = [] for k in keys: row_data.append(item[k]) # print(row_data) writer.writerow(row_data) print('Write a CSV file to path %s Successful.' % self.path) except Exception as e: print('Fail to write CSV to path: %s, Case: %s' % (self.path, e)) if __name__ == ’__main__’: start = time.time() home_link_spider = HomeLinkSpider() home_link_spider.parse_page() home_link_spider.write_csv_file() end = time.time() print('耗時：{}秒'.format(end-start))

現在我們來看下爬取結果，爬取580條數據耗時約16.5秒，節省了2秒時間。可見parsel比BeautifulSoup解析效率是要高的，爬取任務少時差別不大，任務多的話差別可能會大些。

python爬蟲請求庫httpx和parsel解析庫的使用測評

httpx同步 + parsel組合

我們現在來更進一步，使用httpx替代requests庫。httpx發送同步請求的方式和requests庫基本一樣，所以我們只需要修改上例中兩行代碼，把requests替換成httpx即可, 其余代碼一模一樣。

from fake_useragent import UserAgent import csv import re import time from parsel import Selector import httpx class HomeLinkSpider(object): def __init__(self): self.ua = UserAgent() self.headers = {'User-Agent': self.ua.random} self.data = list() self.path = '浦東_三房_500_800萬.csv' self.url = 'https://sh.lianjia.com/ershoufang/pudong/a3p5/' def get_max_page(self): # 修改這里把requests換成httpx response = httpx.get(self.url, headers=self.headers) if response.status_code == 200: # 創建Selector類實例 selector = Selector(response.text) # 采用css選擇器獲取最大頁碼div Boxl a = selector.css(’div[class='page-box house-lst-page-box']’) # 使用eval將page-data的json字符串轉化為字典格式 max_page = eval(a[0].xpath(’//@page-data’).get())['totalPage'] print('最大頁碼數:{}'.format(max_page)) return max_page else: print('請求失敗 status:{}'.format(response.status_code)) return None def parse_page(self): max_page = self.get_max_page() for i in range(1, max_page + 1): url = ’https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/’.format(i) # 修改這里把requests換成httpx response = httpx.get(url, headers=self.headers) selector = Selector(response.text) ul = selector.css(’ul.sellListContent’)[0] li_list = ul.css(’li’) for li in li_list: detail = dict() detail[’title’] = li.css(’div.title a::text’).get() # 2室1廳 | 74.14平米 | 南 | 精裝 | 高樓層(共6層) | 1999年建 | 板樓 house_info = li.css(’div.houseInfo::text’).get() house_info_list = house_info.split(' | ') detail[’bedroom’] = house_info_list[0] detail[’area’] = house_info_list[1] detail[’direction’] = house_info_list[2] floor_pattern = re.compile(r’d{1,2}’) match1 = re.search(floor_pattern, house_info_list[4]) # 從字符串任意位置匹配 if match1: detail[’floor’] = match1.group() else: detail[’floor’] = '未知' # 匹配年份 year_pattern = re.compile(r’d{4}’) match2 = re.search(year_pattern, house_info_list[5]) if match2: detail[’year’] = match2.group() else: detail[’year’] = '未知' # 文蘭小區 - 塘橋提取小區名和哈快 position_info = li.css(’div.positionInfo a::text’).getall() detail[’house’] = position_info[0] detail[’location’] = position_info[1] # 650萬，匹配650 price_pattern = re.compile(r’d+’) total_price = li.css(’div.totalPrice span::text’).get() detail[’total_price’] = re.search(price_pattern, total_price).group() # 單價64182元/平米，匹配64182 unit_price = li.css(’div.unitPrice span::text’).get() detail[’unit_price’] = re.search(price_pattern, unit_price).group() self.data.append(detail) def write_csv_file(self): head = ['標題', '小區', '房廳', '面積', '朝向', '樓層', '年份', '位置', '總價(萬)', '單價(元/平方米)'] keys = ['title', 'house', 'bedroom', 'area', 'direction', 'floor', 'year', 'location', 'total_price', 'unit_price'] try: with open(self.path, ’w’, newline=’’, encoding=’utf_8_sig’) as csv_file: writer = csv.writer(csv_file, dialect=’excel’) if head is not None: writer.writerow(head) for item in self.data: row_data = [] for k in keys: row_data.append(item[k]) # print(row_data) writer.writerow(row_data) print('Write a CSV file to path %s Successful.' % self.path) except Exception as e: print('Fail to write CSV to path: %s, Case: %s' % (self.path, e)) if __name__ == ’__main__’: start = time.time() home_link_spider = HomeLinkSpider() home_link_spider.parse_page() home_link_spider.write_csv_file() end = time.time() print('耗時：{}秒'.format(end-start))

整個爬取過程耗時16.1秒，可見使用httpx發送同步請求時效率和requests基本無差別。

python爬蟲請求庫httpx和parsel解析庫的使用測評

注意：Windows上使用pip安裝httpx可能會出現報錯，要求安裝Visual Studio C++, 這個下載安裝好就沒事了。

接下來，我們就要開始王炸了，使用httpx和asyncio編寫一個異步爬蟲看看從鏈家網上爬取580條數據到底需要多長時間。

httpx異步+ parsel組合

Httpx厲害的地方就是能發送異步請求。整個異步爬蟲實現原理時，先發送同步請求獲取最大頁碼，把每個單頁的爬取和數據解析變為一個asyncio協程任務(使用async定義)，最后使用loop執行。

大部分代碼與同步爬蟲相同，主要變動地方有兩個：

# 異步 - 使用協程函數解析單頁面，需傳入單頁面url地址 async def parse_single_page(self, url): # 使用httpx發送異步請求獲取單頁數據 async with httpx.AsyncClient() as client: response = await client.get(url, headers=self.headers) selector = Selector(response.text) # 其余地方一樣 def parse_page(self): max_page = self.get_max_page() loop = asyncio.get_event_loop() # Python 3.6之前用ayncio.ensure_future或loop.create_task方法創建單個協程任務 # Python 3.7以后可以用戶asyncio.create_task方法創建單個協程任務 tasks = [] for i in range(1, max_page + 1): url = ’https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/’.format(i) tasks.append(self.parse_single_page(url)) # 還可以使用asyncio.gather(*tasks)命令將多個協程任務加入到事件循環 loop.run_until_complete(asyncio.wait(tasks)) loop.close()

整個項目代碼如下所示：

from fake_useragent import UserAgent import csv import re import time from parsel import Selector import httpx import asyncio class HomeLinkSpider(object): def __init__(self): self.ua = UserAgent() self.headers = {'User-Agent': self.ua.random} self.data = list() self.path = '浦東_三房_500_800萬.csv' self.url = 'https://sh.lianjia.com/ershoufang/pudong/a3p5/' def get_max_page(self): response = httpx.get(self.url, headers=self.headers) if response.status_code == 200: # 創建Selector類實例 selector = Selector(response.text) # 采用css選擇器獲取最大頁碼div Boxl a = selector.css(’div[class='page-box house-lst-page-box']’) # 使用eval將page-data的json字符串轉化為字典格式 max_page = eval(a[0].xpath(’//@page-data’).get())['totalPage'] print('最大頁碼數:{}'.format(max_page)) return max_page else: print('請求失敗 status:{}'.format(response.status_code)) return None # 異步 - 使用協程函數解析單頁面，需傳入單頁面url地址 async def parse_single_page(self, url): async with httpx.AsyncClient() as client: response = await client.get(url, headers=self.headers) selector = Selector(response.text) ul = selector.css(’ul.sellListContent’)[0] li_list = ul.css(’li’) for li in li_list: detail = dict() detail[’title’] = li.css(’div.title a::text’).get() # 2室1廳 | 74.14平米 | 南 | 精裝 | 高樓層(共6層) | 1999年建 | 板樓 house_info = li.css(’div.houseInfo::text’).get() house_info_list = house_info.split(' | ') detail[’bedroom’] = house_info_list[0] detail[’area’] = house_info_list[1] detail[’direction’] = house_info_list[2] floor_pattern = re.compile(r’d{1,2}’) match1 = re.search(floor_pattern, house_info_list[4]) # 從字符串任意位置匹配 if match1: detail[’floor’] = match1.group() else: detail[’floor’] = '未知' # 匹配年份 year_pattern = re.compile(r’d{4}’) match2 = re.search(year_pattern, house_info_list[5]) if match2: detail[’year’] = match2.group() else: detail[’year’] = '未知' # 文蘭小區 - 塘橋提取小區名和哈快 position_info = li.css(’div.positionInfo a::text’).getall() detail[’house’] = position_info[0] detail[’location’] = position_info[1] # 650萬，匹配650 price_pattern = re.compile(r’d+’) total_price = li.css(’div.totalPrice span::text’).get() detail[’total_price’] = re.search(price_pattern, total_price).group() # 單價64182元/平米，匹配64182 unit_price = li.css(’div.unitPrice span::text’).get() detail[’unit_price’] = re.search(price_pattern, unit_price).group() self.data.append(detail) def parse_page(self): max_page = self.get_max_page() loop = asyncio.get_event_loop() # Python 3.6之前用ayncio.ensure_future或loop.create_task方法創建單個協程任務 # Python 3.7以后可以用戶asyncio.create_task方法創建單個協程任務 tasks = [] for i in range(1, max_page + 1): url = ’https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/’.format(i) tasks.append(self.parse_single_page(url)) # 還可以使用asyncio.gather(*tasks)命令將多個協程任務加入到事件循環 loop.run_until_complete(asyncio.wait(tasks)) loop.close() def write_csv_file(self): head = ['標題', '小區', '房廳', '面積', '朝向', '樓層', '年份', '位置', '總價(萬)', '單價(元/平方米)'] keys = ['title', 'house', 'bedroom', 'area', 'direction', 'floor', 'year', 'location', 'total_price', 'unit_price'] try: with open(self.path, ’w’, newline=’’, encoding=’utf_8_sig’) as csv_file: writer = csv.writer(csv_file, dialect=’excel’) if head is not None: writer.writerow(head) for item in self.data: row_data = [] for k in keys: row_data.append(item[k]) writer.writerow(row_data) print('Write a CSV file to path %s Successful.' % self.path) except Exception as e: print('Fail to write CSV to path: %s, Case: %s' % (self.path, e)) if __name__ == ’__main__’: start = time.time() home_link_spider = HomeLinkSpider() home_link_spider.parse_page() home_link_spider.write_csv_file() end = time.time() print('耗時：{}秒'.format(end-start))

現在到了見證奇跡的時刻了。從鏈家網上爬取了580條數據，使用httpx編寫的異步爬蟲僅僅花了2.5秒!!

python爬蟲請求庫httpx和parsel解析庫的使用測評

對比與總結

爬取同樣的內容，采用不同工具組合耗時是不一樣的。httpx異步+parsel組合毫無疑問是最大的贏家, requests和BeautifulSoup確實可以功成身退啦。

requests + BeautifulSoup: 18.5 秒 requests + parsel: 16.5秒 httpx 同步 + parsel: 16.1秒 httpx 異步 + parsel: 2.5秒

對于Python爬蟲，你還有喜歡的庫嗎?

以上就是python爬蟲請求庫httpx和parsel解析庫的使用測評的詳細內容，更多關于python httpx和parsel的資料請關注好吧啦網其它相關文章！

Python 編程

上一條：python 中[0]*2與0*2的區別說明下一條：Python 中數組和數字相乘時的注意事項說明

相關文章：

1. html小技巧之td,div標簽里內容不換行2. 使用css實現全兼容tooltip提示框3. 詳解盒子端CSS動畫性能提升4. CSS hack用法案例詳解5. 告別AJAX實現無刷新提交表單6. CSS Hack大全-教你如何區分出IE6-IE10、FireFox、Chrome、Opera7. 讀大數據量的XML文件的讀取問題8. 詳解瀏覽器的緩存機制9. HTML DOM setInterval和clearInterval方法案例詳解10. XML入門的常見問題(一)

排行榜

					
					Docker容器如何更新打包并上傳到阿里云
JavaScript實現通訊錄功能
JetBrains IntelliJ IDEA 配置優化技巧
IDEA編譯亂碼Build Output提示信息亂碼
詳解瀏覽器的緩存機制
Django結合使用Scrapy爬取數據入庫的方法示例
asp中response.write("中文")或者js中文亂碼問題
告別AJAX實現無刷新提交表單
ASP腳本組件實現服務器重啟
PHP設計模式中工廠模式深入詳解
快速解決ajax返回值給外部函數的問題