轉載請注明:陳熹 chenx6542@foxmail.com (簡書號:半為花間酒)
若公眾號內轉載請聯系公眾號:早起Python
Scrapy是純Python語言實現的爬蟲框架,簡單、易用、拓展性高是其主要特點。
這里不過多介紹Scrapy的基本知識點,主要針對其高拓展性詳細介紹各個主要部件的配置方法。
其實也不詳細,不過應該能滿足大多數人的需求了 : )
當然,更多信息可以仔細閱讀官方文檔。
Scrapy官方文檔
放一個Scrapy數據流的圖供復習和參考
(不過這個圖不完整,SpiderMiddleware都沒有..)
接下來進入正題,有些具體的示例以某瓣spider為例
創建命令
scrapy startproject <Project_name>
scrapy genspider <spider_name> <domains>
如果想要創建全網爬取的便捷框架crawlspider,則用如下命令:
scrapy genspider –t crawl <spider_name> <domains>
spider.py
首先介紹最核心(也不一定)的部件spider.py
import scrapy
# 有些命令如果有python基礎的都明白,我不做過多介紹
import json
# 需要做持久化所以導入item,也可以根據文件夾名慢慢導入
from ..items import DoubanItem
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com']
# 對單個爬蟲設置請求頭
custom_settings = {
'DEFAULT_REQUEST_HEADERS': {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}}
# 很多時候并不需要重載這個函數,如果需要定制化起始url或者單獨設置請求頭可以選擇重載
def start_requests(self):
page = 18
base_url = 'https://xxxx'
for i in range(page):
url = base_url.format(i * 20)
req = scrapy.Request(url=url, callback=self.parse)
# 對某個請求添加請求頭,后面的請求如果要設置也是類似方法
# req.headers['User-Agent'] = ''
yield req
# 沒有特別要解釋,就是常規的頁面解析拋給...(看數據流就明白了)
def parse(self, response):
json_str = response.body.decode('utf-8')
res_dict = json.loads(json_str)
for i in res_dict['subjects']:
url = i['url']
yield scrapy.Request(url=url, callback=self.parse_detailed_page)
# scrapy的response可以直接用xpath解析,基礎東西大家都懂不贅述
def parse_detailed_page(self, response):
title = response.xpath('//h1/span[1]/text()').extract_first()
year = response.xpath('//h1/span[2]/text()').extract()[0]
image = response.xpath('//img[@rel="v:image"]/@src').extract_first()
item = DoubanItem()
item['title'] = title
item['year'] = year
item['image'] = image
# 如果要下載圖片需要單獨設置,ImagePipelines,同樣在settings和pipelines都需要相應設置
item['image_urls'] = [image]
yield item
如果是全網爬取,則框架中spiders的部分開頭會略有差別
rules = (
Rule(LinkExtractor(allow=r'http://digimons.net/digimon/.*/index.html'), callback='parse_item', follow=False),)
關鍵就是follow的設置了,是否到達既定深度和頁面需要自己把握
提一嘴,請求頭可以在三個地方設置,決定了請求頭的影響范圍
- 在settings中設置,范圍最大,影響整個框架的所有spider
- 在spiders類變量處設置,影響該spider的所有請求
- 在具體請求中設置,只影響該request
三處設置的影響范圍實際就是從全局到單個爬蟲到單個請求
如果同時存在則單個請求的headers設置優先級最高
items.py
import scrapy
class DoubanItem(scrapy.Item):
title = scrapy.Field()
year = scrapy.Field()
image = scrapy.Field()
# 下載圖片的ImagePipelines也需要設置items
image_urls = scrapy.Field()
# 持久化存儲我選擇用mysql,不具體展開
def get_insert_sql_and_data(self):
# CREATE TABLE douban(
# id int not null auto_increment primary key,
# title text, `year` int, image text)ENGINE=INNODB DEFAULT CHARSET=UTF8mb4;
insert_sql = 'INSERT INTO douban(title,`year`,image)' \ # 系統關鍵字需要加``
'VALUES(%s,%s,%s)'
data = (self['title'],self['year'],self['image'])
return (insert_sql, data)
middlewares.py
中間件就很靈性了,很多小伙伴也不一定用的到,但實際上在配置代理時很重要,一般需求不去配置SpiderMiddleware,主要針對DownloaderMiddleware進行修改
# 信號,這個名詞在scrapy自定義拓展中很重要
from scrapy import signals
# 本地配置的類,代碼見后續,可以搭在自己的IP池上,也可以直接掛在收費IP(比如我)
from proxyhelper import Proxyhelper
# 多線程操作同一個對象需要鎖,用法就是實例化以后一鎖一釋放
from twisted.internet.defer import DeferredLock
class DoubanSpiderMiddleware(object): # spider中間件不設置
pass
class DoubanDownloaderMiddleware(object):
def __init__(self):
# 對IP配置的代理和鎖都實例化
self.helper = Proxyhelper()
self.lock = DeferredLock()
@classmethod
def from_crawler(cls, crawler): # 不修改
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# request的數據流到達下載中間件的時候出發
self.lock.acquire()
request.meta['Proxy'] = self.helper.get_proxy()
self.lock.release()
return None
def process_response(self, request, response, spider):
# 對響應判斷,如果不符合就換代理重新請求
if response.status != 200:
self.lock.acquire()
self.helper.update_proxy(request.meta['Proxy'])
self.lock.release()
return request
return response
def process_exception(self, request, exception, spider):
self.lock.acquire()
self.helper.update_proxy(request.meta['Proxy'])
self.lock.release()
return request
def spider_opened(self, spider): # 不修改
spider.logger.info('Spider opened: %s' % spider.name)
附上proxyhelper配置的代碼
其實本文多數代碼基礎都來自B站某些網課,所以B站是國內最大的學習網站 : )
import requests
class Proxyhelper(object):
def __init__(self):
self.proxy = self._get_proxy_from_xxx()
def get_proxy(self):
return self.proxy
def update_proxy(self, proxy):
if proxy == self.proxy:
print('Updating a proxy')
self.proxy = self._get_proxy_from_xxx()
def _get_proxy_from_xxx(self):
url = '' # 此處修改url,最好是一次返回一個ip
response = requests.get(url)
return 'http://' + response.text.strip()
pipelines.py
# 載入本地的mysql持久化類,按需自己寫
from mysqlhelper import Mysqlhelper
# 載入ImagesPipeline便于重載,自定義一些功能
from scrapy.pipelines.images import ImagesPipeline
import hashlib
from scrapy.utils.python import to_bytes
from scrapy.http import Request
class DoubanImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
request_lst = []
for x in item.get(self.images_urls_field, []):
req = Request(x)
req.meta['movie_name'] = item['title'] # 獲取名字
request_lst.append(req)
return request_lst
# 重載
def file_path(self, request, response=None, info=None):
image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
return 'full/%s.jpg' % (request.meta['movie_name']) # 修改圖片名
# 無特殊,有些步驟在items已經寫完,實現pipelines和items功能上的分離
class DoubanPipeline(object):
def __init__(self):
self.mysqlhelper = Mysqlhelper()
def process_item(self, item, spider):
if 'get_insert_sql_and_data' in dir(item):
(insert_sql, data) = item.get_insert_sql_and_data()
self.mysqlhelper.execute_sql(insert_sql, data)
return item
setting.py
極其關鍵的部件,注釋已經在代碼中標注
# 爬蟲名稱
BOT_NAME = 'Douban'
SPIDER_MODULES = ['Douban.spiders']
NEWSPIDER_MODULE = 'Douban.spiders'
# 客戶端請求頭
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Douban (+http://www.yourdomain.com)'
# Obey robots.txt rules
# 機器人協定
ROBOTSTXT_OBEY = False
# 并發請求數
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32
# 下載延遲
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# 單域名和單IP并發數,會覆蓋上面的設定
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
# 對爬蟲進行監控
#TELNETCONSOLE_ENABLED = False
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [6023,]
# 操作命令:cmd -> telent 127.0.0.1 6023-> est<>
# Override the default request headers:
# 默認請求頭,項目內所有爬蟲有效
# DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
# 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
# }
# 爬蟲中間件
# SPIDER_MIDDLEWARES = {
# # 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None
# 'Douban.middlewares.DoubanSpiderMiddleware': 543,
# }
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# 下載中間件
DOWNLOADER_MIDDLEWARES = {
'Douban.middlewares.DoubanDownloaderMiddleware': 560,
# 更改為560的原因在于不同中間件細分很多亞組間,這些組間的數據大小決定了request和response數據流觸碰的順序,具體見官方文檔
}
# 允許url的訪問時限
TIMEOUT = 10
# 深度限制
# DEPTH_LIMIT = 1
# 自定義拓展
EXTENSIONS = {
'Douban.extends.MyExtension': 500,
}
# item-pipelines配置
ITEM_PIPELINES = {
# 'scrapy.pipelines.images.ImagesPipeline': 1, # 圖片下載器需要注冊
'Douban.pipelines.DoubanImagesPipeline': 300,
}
# 利用算法,自動限速
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# 啟用緩存,較少用
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 圖片下載器ImagePipeline的配置,按需開啟
IMAGES_STORE = 'download'
extends.py
自定義擴展,建議設置該部件需要對信號有了解,深入理解scrapy運行過程的信號觸發,實際還是需要對數據流理解的完善。
代碼中我是利用自己寫的類,本質就是利用喵提醒在某些特定時刻觸發提醒(喵提醒打錢?)。當然也可以利用日志或者其他功能強化拓展功能,通過signal的不同觸發時刻針對性設置
需要自己創建,創建位置如圖:
from scrapy import signals
from message import Message
class MyExtension(object):
def __init__(self, value):
self.value = value
@classmethod
def from_crawler(cls, crawler):
val = crawler.settings.getint('MMMM')
ext = cls(val)
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_opened(self, spider):
print('spider running')
def spider_closed(self, spider):
message = Message('spider運行結束')
message.push()
print('spider closed')
running.py
runnings.py最后提一下吧,其實就是一個在python中運行cmd的命令
from scrapy.cmdline import execute
execute('scrapy crawl douban'.split())
以上就是可以滿足基本需求的Scrapy各部件配置,歡迎交流。