前言
眾所周知,BeautifulSoup 是個非常強大的庫,不過還有一些比較流行的解析庫,例如 lxml,使用的是 Xpath 語法,同樣是效率比較高的解析方法。如果大家對 BeautifulSoup 使用不太習(xí)慣的話,可以嘗試下 Xpath。(墻裂推薦哦)
lxml的安裝:
pip install lxml
代碼
# -*- coding: UTF-8 -*-
import requests
from lxml import etree
#request和lxml,用于網(wǎng)絡(luò)請求和解析
import sys
reload(sys)
sys.setdefaultencoding('utf8')
#用于解決python2.7中文編碼問題
ori_url = 'http://maoyan.com/films?sortId=1&offset={}'
#貓眼電影主頁url,offset從0開始遞增,一頁30部電影
headers={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Cookie': 'your cookie',
#填寫你自己的瀏覽器cookie
'Host': 'maoyan.com',
'Referer': 'http://maoyan.com/',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
all_url=[]
for i in range(11):
offset = str(i*30)
req_url = ori_url.format(offset)
all_url.append(req_url)
#一共11頁,url動態(tài)變化
movie_item=list()
i = 0
j = 0
for url in all_url:
html = requests.get(url, headers=headers).text
selector = etree.HTML(html)
infos = selector.xpath('//div[@class="movies-list"]/dl[@class="movie-list"]//div[@class="channel-detail movie-item-title"]/a')
#xpath爬取電影name和電影url
j = i
for info in infos:
movie_item.append(dict())
movie_url = 'http://maoyan.com' + info.xpath('@href')[0]
movie_name = info.xpath('text()')[0]
movie_item[i]['name'] = movie_name
movie_item[i]['url'] = movie_url
i += 1
score = selector.xpath('//div[@class="channel-detail channel-detail-orange"]')
#xpath爬取電影評分(兩種情況:有評分/暫無評分)
for item in score:
if item.text == None:
sc= item.getchildren()[0].text+item.getchildren()[1].text
else:
sc= item.text
movie_item[j]['score'] = sc
j+=1
movie_item = sorted(movie_item, key=lambda item:item['score'], reverse=False)
#按照評分排序
file=open('./p_data/movieinfos.txt','w')
#將結(jié)果寫入本地文件
print len(movie_item)
for i in range(len(movie_item)):
file.write(str(movie_item[i]['name'])+' '+str(movie_item[i]['score'])+' '+str(movie_item[i]['url'])+'\n')
file.close()
最終結(jié)果
image.png
Ps:銀翼殺手和異形契約在我看來是很好的兩部電影,導(dǎo)演水準也很高(心情復(fù)雜