學習python時,爬蟲是一種簡單上手的方式,應該也是一個必經階段。本項目用Scrapy框架實現了抓取豆瓣top250電影,并將圖片及其它信息保存下來。爬取豆瓣top250電影不需要登錄、沒有JS解析、而且只有10頁內容,用來練手,太合適不過了。
我的開發環境
- WIN10 64位系統
- Python 3.6.1
- PyCharm、Sublime Text
- Mysql、MongoDB,可視化:DbVisualizer、Robomongo
項目目錄
項目目錄
- spiders/sp_douban.py:處理鏈接,解析item部分
- items.py:豆瓣top250電影字段
- middlewares.py、user_agents.py:從維護的UserAgent池中隨機選取
- settings.py:配置文件
- main.py:免去在命令行輸入運行指令
頁面抓取內容分析
入口地址:https://movie.douban.com/top250
內容區
span內容
如圖所示,抓取信息對應如下:
class DoubanTopMoviesItem(scrapy.Item):
title_ch = scrapy.Field() # 中文標題
# title_en = scrapy.Field() # 外文名字
# title_ht = scrapy.Field() # 港臺名字
# detail = scrapy.Field() # 導演主演等信息
rating_num = scrapy.Field() # 分值
rating_count = scrapy.Field() # 評論人數
# quote = scrapy.Field() # 短評
image_urls = scrapy.Field() # 封面圖片地址
topid = scrapy.Field() # 排名序號
用xpath取出對應路徑,進行必要的清洗,去除空格等多余內容:
item['title_ch'] = response.xpath('//div[@class="hd"]//span[@class="title"][1]/text()').extract()
en_list = response.xpath('//div[@class="hd"]//span[@class="title"][2]/text()').extract()
item['title_en'] = [en.replace('\xa0/\xa0','').replace(' ','') for en in en_list]
ht_list = response.xpath('//div[@class="hd"]//span[@class="other"]/text()').extract()
item['title_ht'] = [ht.replace('\xa0/\xa0','').replace(' ','') for ht in ht_list]
detail_list = response.xpath('//div[@class="bd"]/p[1]/text()').extract()
item['detail'] = [detail.replace(' ', '').replace('\xa0', '').replace('\n', '') for detail in detail_list]
# 注意:有的電影沒有quote!!!!!!!!!!
item['quote'] = response.xpath('//span[@class="inq"]/text()').extract()
item['rating_num'] = response.xpath('//div[@class="star"]/span[2]/text()').extract()
# 評價數格式:“XXX人評價”。用正則表達式只需取出XXX數字
count_list = response.xpath('//div[@class="star"]/span[4]/text()').extract()
item['rating_count'] = [re.findall('\d+',count)[0] for count in count_list]
item['image_urls'] = response.xpath('//div[@class="pic"]/a/img/@src').extract()
item['topid'] = response.xpath('//div[@class="pic"]/em/text()').extract()
爬取鏈接的三種方式
第二頁的鏈接格式是:https://movie.douban.com/top250?start=25&filter= ,每頁25部電影,所以翻頁就是依次加25
- 重寫start_requests方法
base_url = "https://movie.douban.com/top250"
# 共有10頁,格式固定。重寫start_requests方法,等價于start_urls及翻頁
def start_requests(self):
for i in range(0, 226, 25):
url = self.base_url + "?start=%d&filter=" % i
yield scrapy.Request(url, callback=self.parse)
- 初始start_urls加當前頁的下一頁
base_url = "https://movie.douban.com/top250"
start_urls = [base_url]
# 取下一頁鏈接,繼續爬取
new_url = response.xpath('//link[@rel="next"]/@href').extract_first()
if new_url:
next_url = self.base_url+new_url
yield scrapy.Request(next_url, callback=self.parse)
- 初始start_urls加LinkExtractor 鏈接提取器方法
# 這個方法需要較大調整(引入更多模塊、類繼承CrawlSpider、方法命名不能是parse)
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
base_url = "https://movie.douban.com/top250"
start_urls = [base_url]
rules = [Rule(LinkExtractor(allow=(r'https://movie.douban.com/top250\?start=\d+.*')),
callback='parse_item', follow=True)
]
下載及保存內容
綜合其他人的教程,本項目集成了多種保存方法,包括保存電影封面至本地、存入MYSQL、存入MONGODB。在settings里配置了ITEM_PIPELINES,用到那種方式,就把注釋去掉即可。
-
自定義下載圖片方法
圖片效果# 自定義方法下載圖片 class FirsttestPipeline(object): # 電影封面命名:序號加電影名 def _createmovieImageName(self, item): lengh = len(item['topid']) return [item['topid'][i] + "-" + item['title_ch'][i] + ".jpg" for i in range(lengh)] # 另一種命名法,取圖片鏈接中名字 # def _createImagenameByURL(self, image_url): # file_name = image_url.split('/')[-1] # return file_name def process_item(self, item, spider): namelist = self._createmovieImageName(item) dir_path = '%s/%s' % (settings.IMAGES_STORE, spider.name) # print('dir_path', dir_path) if not os.path.exists(dir_path): os.makedirs(dir_path) for i in range(len(namelist)): image_url = item['image_urls'][i] file_name = namelist[i] file_path = '%s/%s' % (dir_path, file_name) if os.path.exists(file_path): print("重復,跳過:" + image_url) continue with open(file_path, 'wb') as file_writer: print("正在下載:"+image_url) conn = urllib.request.urlopen(image_url) file_writer.write(conn.read()) file_writer.close() return item
-
保存內容至MYSQL數據庫
前提是裝好mysql,這部分請自行解決。本項目建表語句:
CREATE TABLE DOUBANTOPMOVIE ( topid int(3) PRIMARY KEY , title_ch VARCHAR(50) , rating_num FLOAT(1), rating_count INT(9), quote VARCHAR(100), createdTime TIMESTAMP(6) not NULL DEFAULT CURRENT_TIMESTAMP(6), updatedTime TIMESTAMP(6) not NULL DEFAULT CURRENT_TIMESTAMP(6) ON UPDATE CURRENT_TIMESTAMP(6) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
具體實現方法:
# 保存內容至MYSQL數據庫 class DoubanmoviePipeline(object): def __init__(self, dbpool): self.dbpool = dbpool @classmethod def from_settings(cls, settings): dbparams = dict( host=settings['MYSQL_HOST'], port=settings['MYSQL_PORT'], db=settings['MYSQL_DBNAME'], user=settings['MYSQL_USER'], passwd=settings['MYSQL_PASSWD'], charset=settings['MYSQL_CHARSET'], cursorclass=MySQLdb.cursors.DictCursor, use_unicode=False, ) dbpool = adbapi.ConnectionPool('MySQLdb', **dbparams) # **表示將字典擴展為關鍵字參數 return cls(dbpool) # pipeline默認調用 def process_item(self, item, spider): # 調用插入的方法 query=self.dbpool.runInteraction(self._conditional_insert,item) # 調用異常處理方法 query.addErrback(self._handle_error,item,spider) return item def _conditional_insert(self, tx, item): sql = "insert into doubantopmovie(topid,title_ch,rating_num,rating_count) values(%s,%s,%s,%s)" lengh = len(item['topid']) for i in range(lengh): params = (item["topid"][i], item["title_ch"][i], item["rating_num"][i], item["rating_count"][i]) tx.execute(sql, params) def _handle_error(self, e): print(e)
-
保存內容至MONGODB數據庫
前提是裝好mongodb,這部分請自行解決。可視化工具推薦Robomongo,本項目保存結果及實現方法:
mongodb截圖# 保存內容至MONGODB數據庫 class MongoDBPipeline( object): mongo_uri_no_auth = 'mongodb://localhost:27017/' # 沒有賬號密碼驗證 database_name = 'yun' table_name = 'coll' client = MongoClient(mongo_uri_no_auth) # 創建了與mongodb的連接 db = client[database_name] table = db[table_name] # 獲取數據庫中表的游標 def process_item(self, item, spider): valid = True for data in item: if not data: valid = False raise DropItem("Missing {0}!".format(data)) if valid: self.table.insert(dict(item)) return item
-
用內置的ImagesPipeline類下載圖片
Scrapy自帶的ImagesPipeline 實現起來也很簡單。不過,比較下來,速度不及自定義的方法,不知是否哪里寫的不對。若有高手發現,歡迎指出原因。
from scrapy.contrib.pipeline.images import ImagesPipeline from scrapy.http import Request from scrapy.exceptions import DropItem # 用Scrapy內置的ImagesPipeline類下載圖片 class MyImagesPipeline(ImagesPipeline): def file_path(self, request, response=None, info=None): image_name = request.url.split('/')[-1] return 'doubanmovie2/%s' % (image_name) # 從item獲取url,返回request對象給pipeline處理 def get_media_requests(self, item, info): for image_url in item['image_urls']: yield Request(image_url) # pipeline處理request對象,完成下載后,將results傳給item_completed def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] # print(image_paths) if not image_paths: raise DropItem("Item contains no images") # item['image_paths'] = image_paths return item
其它
from scrapy.selector import Selector
Selector(response).xpath('//span/text()').extract()
# 等價于下面寫法:
response.selector.xpath('//span/text()').extract() # .selector 是response對象的屬性
# 也等價于下面寫法(進一步簡化):
response.xpath('//span/text()').extract()
完整項目代碼見Github
覺得對你有所幫助的話,給個star ?吧