前語為了照顧有英語障礙的朋友,部分文字是經(jīng)過翻譯的
image.png
在使解析HTML(例如,抓取Web)盡可能簡單直觀。
使用此庫時,您會自動獲得:
完整的JavaScript支持!
- CSS Selectors(又名jQuery風(fēng)格,感謝PyQuery)。
- XPath Selectors,對于膽小的人來說。
- 模擬用戶代理(如真實(shí)的Web瀏覽器)。
- 自動跟蹤重定向。
- 連接池和cookie持久性。
- 請求體驗(yàn)?zāi)煜ず拖矏郏哂猩衿娴慕馕瞿芰Α?/li>
教程和用法
使用請求向'python.org'發(fā)出GET請求:>>> from requests_html import `
>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get(' https://python.org/ ')
按原樣獲取頁面上所有鏈接的列表(不包括錨點(diǎn)):
r.html.links {'//docs.python.org/3/tutorial/','/ about / apps
image.png
以絕對形式獲取頁面上所有鏈接的列表(不包括錨點(diǎn)):
>>> r.html.absolute_links
{'https://github.com/python/pythondotorg/issues','https://docs.python.org/3/tutorial/','....}
選擇帶有CSS Selector的元素:
>>> about = r.html.find(' #about ',first = True)
抓取元素的文本內(nèi)容:
>>> print(about.text)
關(guān)于
應(yīng)用程序
報價
入門
幫助
Python手冊
反思Element的屬性:
>>> about.attrs
{'id':'about','class':('tier-1','element-1'),'aria-haspopup':'true'}
渲染元素的HTML:
>>> about.html
'<li aria-haspopup =“true”class =“tier-1 element-1”id =“about”> \ n <a class =“”href =“/ about /
選擇元素中的元素:
>>> about.find(' a ')
[< Element'a'href ='/ about /'title =''class =''>,<Element'a'href ='/ about / apps /'title = ''>,<Element'a'href ='/ about / quotes /'title =''>,<Element'a'href ='/ about / gettingstarted /'title =''>,<Element'a'href ='/ about / help /'title =''>,<Element'a'href ='http://brochure.getpython.info/'title =''>]
搜索元素中的鏈接:
>>> about.absolute_links
{'http://brochure.getpython.info/','https://www.python.org/about/gettingstarted/','https://www.python.org/about/ ','https://www.python.org/about/quotes/','https://www.python.org/about/help/','https://www.python.org/about/apps /'}
在頁面上搜索文字:
>>> r.html.search('Python is a {} language')[0]
programming
更復(fù)雜的CSS Selector示例(從Chrome開發(fā)工具復(fù)制):
>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div
> div > div.col-md-7.text-center.text-md-left > p'
>>> print(r.html.find(sel, first=True).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.
XPath is also supported:
>>> r.html.xpath('/html/body/div[1]/a')
[<Element 'a' class=('px-2', 'py-4', 'show-on-focus', 'js-skip-to-content') href='#start-of-content' tabindex='1'>]
JavaScript Support
讓我們抓一些由JavaScript呈現(xiàn)的文本:
>>> r = session.get('http://python-requests.org')
>>> r.html.render()
>>> r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'
請注意,第一次運(yùn)行該render()方法時,它會將Chromium下載到您的主目錄(例如~/.pyppeteer/)。這只發(fā)生過一次。
使用不帶請求
您也可以在沒有請求的情況下使用此庫:
>>> from requests_html import HTML
>>> doc = “”“ <a > ”“”
>>> html = HTML(html = doc)
>>> html.links
{ 'https://httpbin.org'}