最近學習python爬蟲,寫了一個簡單的遞歸爬蟲下載美女圖片的程序。廢話不多說,先上圖:
一共是三千多張美圖哦:)
python版本為3.5,使用urllib和urllib.request訪問網(wǎng)頁,用BeautifulSoup解析獲取到的html,找到主頁面中的圖片鏈接和新的頁面的鏈接,下載完圖片后,依次訪問新的鏈接,進行遞歸爬蟲,直到遞歸到最深層。其中集合set存放已爬過的頁面,以免訪問到相同的頁面。源碼如下:
import urllib
import urllib.request
import re
import time
from threading import *
from bs4 import BeautifulSoup
screenLock = Semaphore(value=1)
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}
main_url = 'http://www.chunmm.com'
num = 1
pages = set()
pages.add(main_url)
def downloadimg(url, depth):
if depth != 0:
print(depth)
print(url)
req = urllib.request.Request(url, headers=headers)
html = urllib.request.urlopen(req).read().decode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
imgurllist = soup.find_all('img', {'src': re.compile(r'http://.+.jpg')})
urllist = soup.find_all('a', {'href': re.compile(r'/.+?/.+?.html')})
local_path = 'd:/OOXXimg/'
global num
for item in imgurllist:
print(item["src"])
url = item["src"]
path = local_path + str(num) + '.jpg'
urllib.request.urlretrieve(url, path)
num += 1
screenLock.acquire()
print(str(num)+' img was downloaded\n')
screenLock.release()
for url in urllist:
if url not in pages:
global main_url
newurl = main_url+url["href"]
downloadimg(newurl, depth-1)
pages.add(url)
time.sleep(1)
else:
return
def main():
downloadimg(main_url, 3)
if name == 'main':
main()
注意:最好在訪問頁面時加上異常處理,以免訪問頁面時url出現(xiàn)異常導致程序退出。該實例程序遞歸層次為3層,共下載3000多張圖片。
多謝閱讀!