爬蟲前期準(zhǔn)備
01 爬蟲就是模擬瀏覽器抓取東西,爬蟲三部曲:數(shù)據(jù)爬取、數(shù)據(jù)解析、數(shù)據(jù)存儲(chǔ)
數(shù)據(jù)爬取:手機(jī)端、pc端數(shù)據(jù)解析:正則表達(dá)式數(shù)據(jù)存儲(chǔ):存儲(chǔ)到文件、存儲(chǔ)到數(shù)據(jù)庫
02. 相關(guān)python庫
爬蟲需要兩個(gè)庫模塊:requests和re
1. requests庫
requests是比較簡單易用的HTTP庫,相較于urllib會(huì)簡潔很多,但由于是第三方庫,所以需要安裝,文末附上安裝教程鏈接(鏈接全在后面,這樣會(huì)比較方便看吧,貼心吧~)
requests庫支持的HTTP特性:
保持活動(dòng)和連接池、Cookie持久性會(huì)話、分段文件上傳、分塊請求等
Requests庫中有許多方法,所有方法在底層的調(diào)用都是通過request()方法實(shí)現(xiàn)的,所以嚴(yán)格來說Requests庫只有request()方法,但一般不會(huì)直接使用request()方法。以下介紹Requests庫的7個(gè)主要的方法:
①requests.request()
構(gòu)造一個(gè)請求,支撐一下請求的方法
具體形式:requests.request(method,url,**kwargs)
method:請求方式,對應(yīng)get,post,put等方式
url:擬獲取頁面的url連接
**kwargs:控制訪問參數(shù)
②requests.get()
獲取網(wǎng)頁HTM網(wǎng)頁的主要方法,對應(yīng)HTTP的GET。構(gòu)造一個(gè)向服務(wù)器請求資源的Requests對象,返回一個(gè)包含服務(wù)器資源的Response對象。
Response對象的屬性:
屬性說明r.status_codeHTTP請求的返回狀態(tài)(連接成功返回200;連接失敗返回404)r.textHTTP響應(yīng)內(nèi)容的字符串形式,即:url對應(yīng)的頁面內(nèi)容r.encoding從HTTP header中猜測的響應(yīng)內(nèi)容編碼方式r.apparent_encoding從內(nèi)容中分析出的響應(yīng)內(nèi)容編碼方式(備選編碼方式)r.contentHTTP響應(yīng)內(nèi)容的二進(jìn)制形式
具體形式:res=requests.get(url)
code=res.text (text為文本形式;bin為二進(jìn)制;json為json解析)
③requests.head()
獲取HTML的網(wǎng)頁頭部信息,對應(yīng)HTTP的HEAD
具體形式:res=requests.head(url)
④requests.post()
向網(wǎng)頁提交post請求方法,對應(yīng)HTTP的POST
具體形式:res=requests.post(url)
⑤requests.put()
向網(wǎng)頁提交put請求方法,對應(yīng)HTTP的PUT
⑥r(nóng)equests.patch()
向網(wǎng)頁提交局部修改的請求,對應(yīng)HTTP的PATCH
⑦requests.delete()
向網(wǎng)頁提交刪除的請求,對應(yīng)HTTP的DELETE
"""requests 操作練習(xí)"""
import requests
import re
#數(shù)據(jù)的爬取
h = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
response = requests.get('https://movie.douban.com/chart',headers=h)
html_str = response.text
#數(shù)據(jù)解析<a class="nbg" title="漢密爾頓">
pattern = re.compile('<a class="nbg".*?title="(.*?)">') # .*? 任意匹配盡可能多的匹配盡可能少的字符
result = re.findall(pattern,html_str)
print(result)
2. re正則表達(dá)式:(Regular Expression)
一組由字母和符號(hào)組成的特殊字符串,作用:從文本中找到你想要的格式的句子
關(guān)于 .*? 的解釋:
* 匹配前面的子表達(dá)式零次或多次。例如,zo能匹配“z”以及“zoo”。等價(jià)于{0,}。
? 匹配模式是非貪婪的。非貪婪模式盡可能少的匹配所搜索的字符串。例如,對于字符串“oooo”,“o+?”將匹配單個(gè)“o”,而“o+”將匹配所有“o”。
. 匹配除“\n”之外的任何單個(gè)字符。要匹配包括“\n”在內(nèi)的任何字符,請使用像“(.
.* 具有貪婪的性質(zhì),首先匹配到不能匹配為止,根據(jù)后面的正則表達(dá)式,會(huì)進(jìn)行回溯。
.*?則相反,一個(gè)匹配以后,就往下進(jìn)行,所以不會(huì)進(jìn)行回溯,具有最小匹配的性質(zhì)(盡可能匹配少的字符但是要匹配出所有的字符)。
(.*) 是貪婪匹配代表盡可能多的匹配字符因此它將h和l之間所有的字符都匹配了出來
03. xpath解析源碼
import requests
import re
from bs4 import BeautifulSoup
from lxml import etree
#數(shù)據(jù)爬取(一些HTTP頭的信息)
h = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
response = requests.get('https://movie.XX.com/chart',headers=h)
html_str = response.text
#數(shù)據(jù)解析
#正則表達(dá)式解析
def re_parse(html_str):
pattern = re.compile('<a class="nbg".*?title="(.*?)"')
results = re.findall(pattern,html_str)
print(results)
return results
#bs4解析
def bs4_parse(html_str):
soup = BeautifulSoup(html_str,'lxml')
items = soup.find_all(class_='nbg')
for item in items:
print(item.attrs['title'])
#lxml解析
def lxml_parse(html_str):
html = etree.HTML(html_str)
results = html.xpath('//a[@class="nbg"]/@title')
print(results)
return results
re_parse(html_str)
bs4_parse(html_str)
lxml_parse(html_str)
04. python寫爬蟲的架構(gòu)
從圖上可以看到,整個(gè)基礎(chǔ)爬蟲架構(gòu)分為5大類:爬蟲調(diào)度器、URL管理器、HTML下載器、HTML解析器、數(shù)據(jù)存儲(chǔ)器。
下面給大家依次來介紹一下這5個(gè)大類的功能:
① 爬蟲調(diào)度器:主要是配合調(diào)用其他四個(gè)模塊,所謂調(diào)度就是去調(diào)用其他的模板。
② URL管理器:就是負(fù)責(zé)管理URL鏈接的,URL鏈接分為已經(jīng)爬取的和未爬取的,這就需要URL管理器來管理它們,同時(shí)它也為獲取新URL鏈接提供接口。
③ HTML下載器:就是將要爬取的頁面的HTML下載下來。
④ HTML解析器:就是將要爬取的數(shù)據(jù)從HTML源碼中獲取出來,同時(shí)也將新的URL鏈接發(fā)送給URL管理器以及將處理后的數(shù)據(jù)發(fā)送給數(shù)據(jù)存儲(chǔ)器。
⑤ 數(shù)據(jù)存儲(chǔ)器:就是將HTML下載器發(fā)送過來的數(shù)據(jù)存儲(chǔ)到本地。
whois爬取
每年,有成百上千萬的個(gè)人、企業(yè)、組織和政府機(jī)構(gòu)注冊域名,每個(gè)注冊人都必須提供身份識(shí)別信息和聯(lián)系方式,包括姓名、地址、電子郵件、聯(lián)系電話、管理聯(lián)系人和技術(shù)聯(lián)系人一這類信息通常被叫做whois數(shù)據(jù)
"""
whois
http://whois.chinaz.com/sina.com
"""
import requests
import re
h = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
response = requests.get('http://whois.chinaz.com/'+input("請輸入網(wǎng)址:"),headers=h)
print(response.status_code)
html = response.text
#print(html)
#解析數(shù)據(jù)
pattern = re.compile('class="MoreInfo".*?>(.*?)</p>',re.S)
result = re.findall(pattern,html)
# 方法一:
# str = re.sub('\n',',',result[0])
# print(str)
#方法二:
print(result[0].replace('/n',','))
爬取電影信息
"""爬取*眼電影前100電影信息"""
import requests
import re
import time
# count = [0,10,20,30,40,50,60,70,80,90]
h = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
responce = requests.get('https://XX.com/board/4?offset=0', headers=h)
responce.encoding = 'utf-8'
html = responce.text
# 解析數(shù)據(jù) time.sleep(2)
patter = re.compile('class="name">.*?title="(.*?)".*?主演:(.*?)</p>.*?上映時(shí)間:(.*?)</p>', re.S)
#time.sleep(2)
result = re.findall(patter, html)
print(result)
with open('maoyan.txt', 'a', encoding='utf-8') as f:
for item in result: # 讀取result(以元組的形式儲(chǔ)存)中的內(nèi)容=》
for i in item:
f.write(i.strip().replace('\n', ','))
#print('\n')
爬取圖片
"""*精靈爬取練習(xí) http://616pic.com/png/ ==》 http://XX.616pic.com/ys_img/00/06/20/64dXxVfv6k.jpg"""
import requests
import re
import time
#數(shù)據(jù)的爬取img的url
def get_urls():
response = requests.get('http://XX.com/png/')
html_str = response.text
#解析數(shù)據(jù),得到url
pattern = re.compile('<img class="lazy" data-original="(.*?)"')
results = re.findall(pattern,html_str)
print(results)
return results
#<img class="lazy" data-original="http://XX.616pic.com/ys_img/00/06/20/64dXxVfv6k.jpg">
#下載圖片
def down_load_img(urls):
for url in urls:
response = requests.get(url)
with open('temp/'+url.split('/')[-1], 'wb') as f:
f.write(response.content)
print(url.split('/')[-1],'已經(jīng)下載成功')
if __name__ == '__main__':
urls = get_urls()
爬取小仙女
'''頭條美女爬取====方法一'''import requests
import re
url = 'https://www.XX.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E7%BE%8E%E5%A5%B3&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis×tamp=1596180364628&_signature=-Bv0rgAgEBA-TE0juRclmfgatbAAKdC7s6ktYqc7u9jLqXOQ5SBCDkd25scxRvDydd6TgtOw0B7RVuaQxhwY1BwV89sPbdam8LkNuV08d0QfrZqQ4oOOrOukEJ1qxroigLT'
response = requests.get(url)
print(response.status_code)
html_str = response.text
#解析"large_image_url":"(.*?)"
pattern = re.compile('"large_image_url":"(.*?)"')
urls = re.findall(pattern,html_str)
print(urls)def down_load(urls):
for url in urls:
response = requests.get(url)
with open('pic/'+url.split('/')[-1],'wb') as f:
f.write(response.content)
print(url.split('/')[-1],'已經(jīng)下載成功')
if __name__ == '__main__':
down_load(urls)
'''頭條美女爬取====方法二'''import requests
import re
from urllib.parse import urlencode
#https://www.XX.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E7%BE%8E%E5%A5%B3&autoload=true&count=20def get_urls(page):
keys = {
'aid':'24',
'app_name':'web_search',
'offset':20*page,
'keyword':'美女',
'count':'20'
}
keys_word = urlencode(keys)
url = 'https://www.XX.com/api/search/content/?'+keys_word
response = requests.get(url)
print(response.status_code)
html_str = response.text
# 解析"large_image_url":"(.*?)"
pattern = re.compile('"large_image_url":"(.*?)"',re.S)
urls = re.findall(pattern, html_str)
return urls#下載圖片
def download_imags(urls):
for url in urls:
response = requests.get(url)
with open('pic/'+url.split('/')[-1]+'.jpg','wb') as f:
f.write(response.content)
print(url.split('/')[-1]+'.jpg',"已下載~~")if __name__ == '__main__':
for page in range(3):
urls = get_urls(page)
print(urls)
download_imags(urls)
5 線程池
線程池是一種多線程處理形式,處理過程中將任務(wù)添加到隊(duì)列,然后在創(chuàng)建線程后自動(dòng)啟動(dòng)這些任務(wù)。線程池線程都是后臺(tái)線程。每個(gè)線程都使用默認(rèn)的堆棧大小,以默認(rèn)的優(yōu)先級(jí)運(yùn)行,并處于多線程單元中。
"""線程池"""from concurrent.futures import ThreadPoolExecutor
import time
import threadingdef ban_zhuang(i):
print(threading.current_thread().name,"**開始搬磚{}**".format(i))
time.sleep(2)
print("**員工{}搬磚完成**一共搬磚:{}".format(i,12**2)) #將format里的內(nèi)容輸出到{}if __name__ == '__main__': #主線程
start_time = time.time()
print(threading.current_thread().name,"開始搬磚")
with ThreadPoolExecutor(max_workers=5) as pool:
for i in range(10):
p = pool.submit(ban_zhuang,i)
end_time =time.time()
print("一共搬磚{}秒".format(end_time-start_time))
結(jié)合多線程的爬蟲:
'''頭條美女爬取'''import requests
import re
from urllib.parse import urlencode
import timeimport threading
#https://www.XX.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E7%BE%8E%E5%A5%B3&autoload=true&count=20def get_urls(page):
keys = {
'aid':'24',
'app_name':'web_search',
'offset':20*page,
'keyword':'美女',
'count':'20'
}
keys_word = urlencode(keys)
url = 'https://www.XX.com/api/search/content/?'+keys_word
response = requests.get(url)
print(response.status_code)
html_str = response.text
# 解析"large_image_url":"(.*?)"
pattern = re.compile('"large_image_url":"(.*?)"',re.S)
urls = re.findall(pattern, html_str)
return urls#下載圖片
def download_imags(urls):
for url in urls:
try:
response = requests.get(url)
with open('pic/'+url.split('/')[-1]+'.jpg','wb') as f:
f.write(response.content)
print(url.split('/')[-1]+'.jpg',"已下載~~")
except Exception as err:
print('An exception happened: ')
if __name__ == '__main__':
start = time.time()
thread = []
for page in range(3):
urls = get_urls(page)
#print(urls)
#多線程
for url in urls:
th = threading.Thread(target=download_imags,args=(url,))
#download_imags(urls)
thread.append(th)
for t in thread:
t.start()
for t in thread:
t.join()end = time.time()
print('耗時(shí):',end-start)
6 tips--爬蟲協(xié)議
Robots協(xié)議,又稱作爬蟲協(xié)議,機(jī)器人協(xié)議,全名叫做網(wǎng)絡(luò)爬蟲排除標(biāo)準(zhǔn)(Robots Exclusion Protocol),是用來告訴爬蟲和搜索引擎哪些頁面可以抓取,哪些不可以抓取,通常為一個(gè)robots.txt文本文件,一般放在網(wǎng)站的根目錄下。
Robots協(xié)議:在網(wǎng)頁的根目錄+/robots.txt 如www.baidu.com/robots.txt
User-agent: BaiduspiderDisallow: /baiduDisallow: /s?Disallow: /ulink?Disallow: /link?Disallow: /home/news/data/Disallow: /bhUser-agent: GooglebotDisallow: /baiduDisallow: /s?Disallow: /shifen/Disallow: /homepage/Disallow: /cproDisallow: /ulink?Disallow: /link?Disallow: /home/news/data/Disallow: /bh
tips:要遵守爬蟲協(xié)議喲,吶。。只能用于爬著玩兒哈~~~記得掛代理~~~(文中的鏈接我都改過啦,想練手地私聊我,或者自己找鏈接吧。。。挺好玩兒的啦)
7 相關(guān)鏈接
requests的安裝與使用 http://www.lxweimin.com/p/140012f88f8e
re的使用說明 https://www.cnblogs.com/vmask/p/6361858.html
其他的爬蟲相關(guān)文章 https://blog.csdn.net/qq_27297393/article/details/81630774
爬蟲的視頻 https://www.imooc.com/learn/563