最近一直在學習python的scrapy框架。寫了比較多的小例子。工欲善其事必先利其器。今天描述的就是爬取今日頭條的科技板塊新聞。練練這把利器。
教程依賴scrapy,pymongo模塊,可以直接先下載相應的環境依賴。
- 1.分析今日頭條新聞的API接口
- 對于今日頭條這些通過AJAX來異步獲取json數據,正常的等待頁面渲染后再進行提取數據有點顯得力不從心,所以直接通過瀏覽器對網站進行抓包分析。
- 打開瀏覽器,訪問今日頭條的科技新聞模塊,這里的地址是 http://www.toutiao.com/ch/news_tech/
http://www.toutiao.com/ch/news_tech/ -
右鍵審查元素,對頁面的網絡請求資源做分析。勾上紅色箭頭的那個選擇框,選擇記錄網絡請求日記。然后重新刷新網站
360截圖20170925161331582.jpg - 逐一查看記錄的網絡數據包,可以發現 http://www.toutiao.com/api/pc/feed/?category=news_tech&utm_source=toutiao&widen=1&max_behot_time=0&max_behot_time_tmp=0&tadrequire=true&as=A155493CA8EBB0F&cp=59C84BEB601F7E1的請求地址返回了json的數據。
今日頭條 - 返回的數據格式如下所示:
{ "has_more": false, "message": "success", "data": [ { "chinese_tag": "財經", "media_avatar_url": "http://p3.pstatp.com/large/1233000741099c9f4a59", "is_feed_ad": false, "tag_url": "news_finance", "title": "【特寫】數字貨幣的信徒們", "single_mode": true, "middle_mode": true, "abstract": "在九月初在中國發文整治ICO后,硅谷的區塊鏈項目創業者林嚇洪把籌集的資金全部還給了中國投資者們。在那次整治中,監管部門宣布,首次代幣發行(Initial Coin Offering,簡稱ICO)屬于非法行為,所有平臺必須返還籌集的資金。", "tag": "news_finance", "label": [ "數字貨幣", "風投", "比特幣", "投資", "經濟" ], "behot_time": 1506326903, "source_url": "/group/6469550301866803469/", "source": "界面新聞", "more_mode": false, "article_genre": "article", "image_url": "http://p1.pstatp.com/list/190x124/317200041ea1cf451f52", "has_gallery": false, "group_source": 1, "comments_count": 10, "group_id": "6469550301866803469", "media_url": "/c/user/52857496566/" }, { "image_url": "http://p3.pstatp.com/list/190x124/31770009f2c887fdb867", "single_mode": true, "abstract": "早,來看看今天的新聞。小米就校招風波道歉@DoNews【小米就校招風波道歉 對涉事員工通報批評】近日,一名自稱在河南鄭州大學日語專業學習的大學生表示,她與同學在一次校園招聘宣講會上無故被來自小米公司的主管人員諷刺。導致自己和本專業的同學憤然離開。", "middle_mode": false, "more_mode": true, "tag": "news_tech", "label": [ "小米科技", "亞馬遜公司", "Uber", "美國", "樂視" ], "tag_url": "news_tech", "title": "小米就校招風波道歉;ofo正尋求新一輪融資", "chinese_tag": "科技", "source": "虎嗅APP", "group_source": 1, "has_gallery": false, "media_url": "/c/user/3358265611/", "media_avatar_url": "http://p2.pstatp.com/large/18a50010126f235bf938", "image_list": [ { "url": "http://p3.pstatp.com/list/31770009f2c887fdb867" }, { "url": "http://p1.pstatp.com/list/317b00061c410d6d0352" }, { "url": "http://p3.pstatp.com/list/3172000337e0332b337f" } ], "source_url": "/group/6469472579270672654/", "article_genre": "article", "is_feed_ad": false, "behot_time": 1506326303, "comments_count": 114, "group_id": "6469472579270672654" }, { "image_url": "http://p3.pstatp.com/list/190x124/3c64000074857b07c81d", "single_mode": true, "abstract": "藍燕,經常關注香港電影的人應該不陌生,在2011年靠著香港三級影片《3D肉蒲團之極樂寶鑒》走紅,并逐漸出現人們的視線中。被稱為新一代的“艷星”??勺呒t后的她并沒有獲得很好的資源,所接拍的影片大多數是一些不知名的配角。", "middle_mode": false, "more_mode": true, "tag": "news_entertainment", "label": [ "藍燕 ", "肉蒲團", "投資", "娛樂" ], "tag_url": "news_entertainment", "title": "艷星藍燕美照曝光 靠著《3D肉蒲團》走紅", "chinese_tag": "娛樂", "source": "陪你樂不停", "group_source": 2, "has_gallery": false, "media_url": "/c/user/61497461135/", "media_avatar_url": "http://p3.pstatp.com/large/382f000f5dd459d0eb74", "image_list": [ { "url": "http://p3.pstatp.com/list/3c64000074857b07c81d" }, { "url": "http://p3.pstatp.com/list/3c6000022fcec3f4ca48" }, { "url": "http://p3.pstatp.com/list/3c60000230155491a84d" } ], "source_url": "/group/6469578595697164813/", "article_genre": "article", "is_feed_ad": false, "behot_time": 1506325703, "comments_count": 2, "group_id": "6469578595697164813" }, { "log_extra": "{\"ad_price\":\"Wci5d__iJRJZyLl3_-IlEuQYjwGdUeJEIl99Ew\",\"convert_id\":0,\"external_action\":0,\"req_id\":\"201709251608231720180471641841E3\",\"rit\":1}", "image_url": "http://p3.pstatp.com/large/26c00009898dbc9c5a52", "read_count": 12196, "ban_comment": 1, "single_mode": true, "abstract": "", "image_list": [], "has_video": false, "article_type": 1, "tag": "ad", "display_info": "股市迎來重磅利好消息,這些股或將上漲翻倍,微信領取", "has_m3u8_video": 0, "label": "廣告", "user_verified": 0, "aggr_type": 1, "expire_seconds": 314754930, "cell_type": 0, "article_sub_type": 0, "group_flags": 4096, "bury_count": 0, "title": "股市迎來重磅利好消息,這些股或將上漲翻倍,微信領取", "ignore_web_transform": 1, "source_icon_style": 3, "tip": 0, "hot": 0, "share_url": "http://m.toutiao.com/group/6465452273144168717/?iid=0&app=news_article", "has_mp4_video": 0, "source": "聯訊證券", "comment_count": 0, "article_url": "http://cq3.ilyae.cn/toutiao2/index.html", "filter_words": [ { "id": "1:74", "name": "股票", "is_selected": false }, { "id": "1:6", "name": "金融保險", "is_selected": false }, { "id": "2:0", "name": "來源:聯訊證券", "is_selected": false }, { "id": "4:2", "name": "看過了", "is_selected": false } ], "has_gallery": false, "publish_time": 1505355414, "ad_id": 69048936405, "action_list": [ { "action": 1, "extra": {}, "desc": "" }, { "action": 3, "extra": {}, "desc": "" }, { "action": 7, "extra": {}, "desc": "" }, { "action": 9, "extra": {}, "desc": "" } ], "has_image": false, "cell_layout_style": 1, "tag_id": 6465452273144168717, "source_url": "http://cq3.ilyae.cn/toutiao2/index.html", "video_style": 0, "verified_content": "", "is_feed_ad": true, "large_image_list": [], "item_id": 6465452273144168717, "natant_level": 2, "tag_url": "search/?keyword=None", "article_genre": "ad", "level": 0, "cell_flag": 10, "source_open_url": "sslocal://search?from=channel_source&keyword=%E8%81%94%E8%AE%AF%E8%AF%81%E5%88%B8", "display_url": "http://cq3.ilyae.cn/toutiao2/index.html", "digg_count": 0, "behot_time": 1506325103, "article_alt_url": "http://m.toutiao.com/group/article/6465452273144168717/", "cursor": 1506325103999, "url": "http://cq3.ilyae.cn/toutiao2/index.html", "preload_web": 0, "ad_label": "廣告", "user_repin": 0, "label_style": 3, "item_version": 0, "group_id": "6465452273144168717", "middle_image": { "url": "http://p3.pstatp.com/large/26c00009898dbc9c5a52", "width": 456, "url_list": [ { "url": "http://p3.pstatp.com/large/26c00009898dbc9c5a52" }, { "url": "http://pb9.pstatp.com/large/26c00009898dbc9c5a52" }, { "url": "http://pb1.pstatp.com/large/26c00009898dbc9c5a52" } ], "uri": "large/26c00009898dbc9c5a52", "height": 256 } }, { "image_url": "http://p3.pstatp.com/list/190x124/3b050002710aff2b3422", "single_mode": true, "abstract": "如今2017年微信的月活躍用戶達9億,微信成了中國最大用戶群體的手機APP,它集通訊、娛樂、支付等于一體。很多朋友習慣每天打開微信收發信息、查看朋友圈動態。", "middle_mode": false, "more_mode": true, "tag": "news_tech", "label": [ "移動互聯網", "微信", "澤西島", "美女", "歐洲" ], "tag_url": "news_tech", "title": "為什么微信中那么多美女來自安道爾或澤西島?這是一種暗語嗎", "chinese_tag": "科技", "source": "獅子夜光杯", "group_source": 2, "has_gallery": false, "media_url": "/c/user/53397416061/", "media_avatar_url": "http://p3.pstatp.com/large/12330013573aaa4c18b1", "image_list": [ { "url": "http://p3.pstatp.com/list/3b050002710aff2b3422" }, { "url": "http://p3.pstatp.com/list/3b05000271096e15298e" }, { "url": "http://p9.pstatp.com/list/3b080000bdf469bf7330" } ], "source_url": "/group/6467319367565574670/", "article_genre": "article", "is_feed_ad": false, "behot_time": 1506324503, "comments_count": 46, "group_id": "6467319367565574670" }, { "image_url": "http://p3.pstatp.com/list/190x124/3b0f0003c132eb485453", "single_mode": true, "abstract": "最近幾周,各大互聯網科技公司都開始秋季招聘了這些是正經的公司的招聘筆試題:關于c++的inline關鍵字,以下說法正確的是()對N個數進行排序,在各自最優條件下以下算法復雜度最低的是()為百度設計一款新產品,可以結合百度現有的優勢和資源,專注解決大學生用戶的某個需求痛點,請給出主", "middle_mode": false, "more_mode": true, "tag": "news_design", "label": [ "電子商務", "京東", "面試", "劉強東", "計算復雜性理論" ], "tag_url": "search/?keyword=%E8%AE%BE%E8%AE%A1", "title": "京東校招筆試題“如何用0.01元買到一瓶可樂”?竟被蘇寧秀了一臉", "chinese_tag": "設計", "source": "小禾科技", "group_source": 2, "has_gallery": false, "media_url": "/c/user/59954335187/", "media_avatar_url": "http://p9.pstatp.com/large/39b10003f6cddd5128fa", "image_list": [ { "url": "http://p3.pstatp.com/list/3b0f0003c132eb485453" }, { "url": "http://p3.pstatp.com/list/3b110000ab4c79a56483" }, { "url": "http://p9.pstatp.com/list/3b1600007cde1cf9bdd0" } ], "source_url": "/group/6468140283245625870/", "article_genre": "article", "is_feed_ad": false, "behot_time": 1506323903, "comments_count": 87, "group_id": "6468140283245625870" }, { "chinese_tag": "科技", "media_avatar_url": "http://p9.pstatp.com/large/2c6600049c7144303824", "is_feed_ad": false, "tag_url": "news_tech", "title": "為什么家里的WIFI時快時慢?竟然是因為……", "single_mode": true, "middle_mode": false, "abstract": "現在還是個信息的時代,不僅手機、電腦非常普遍,而且現在的人們都喜歡用無線網絡之WiFi,因為這樣更加便捷。在家使用手機的時候,不用打開手機的數據流量,只要使用WiFi就可以了,無限的流量使用,太方便了。但是很多用戶都會有這樣的體驗,WiFi速度時快時慢的,很是煩惱。", "group_source": 2, "image_list": [ { "url": "http://p3.pstatp.com/list/3b1600009ba8a7500c7e" }, { "url": "http://p1.pstatp.com/list/3b1600009bb32db8a78a" }, { "url": "http://p3.pstatp.com/list/3b120000c5dac40ae0fe" } ], "label": [ "Wi-Fi", "科技" ], "behot_time": 1506323303, "source_url": "/group/6468146583144759822/", "source": "水電小知識", "more_mode": true, "article_genre": "article", "image_url": "http://p3.pstatp.com/list/190x124/3b1600009ba8a7500c7e", "tag": "news_tech", "has_gallery": false, "group_id": "6468146583144759822", "media_url": "/c/user/61795844218/" } ], "next": { "max_behot_time": 1506323303 } }
- 2.分析請求的參數以及請求循環性:
- 科技新聞的數據接口使用的是GET請求,傳遞下面幾個查詢參數:
category:news_tech utm_source:toutiao widen:1 max_behot_time:0 max_behot_time_tmp:0 tadrequire:true as:A155493CA8EBB0F cp:59C84BEB601F7E1
- 滑動網頁,再次發出異步請求,觀察請求參數,可以發現只有幾個查詢參數是改變的。從上一次獲取的數據有個字段next->max_behot_time剛好是max_behot_time和max_behot_time_tmp的值。至于as與及cp參數對GET請求影響不大,可以直接取某一次分析的參數值就是max_behot_time參數,作者認為是當前的時間戳,現在數據已經展示給我們,我們就沒必要去猜測,有時候抓包分析就是一種猜測API參數意義的過程,大家可以去驗證:
max_behot_time:1506326351 max_behot_time_tmp:1506326351 as:A115996C383BD3C cp:59C82BAD839CBE1
- 3.構造請請求地址:
- scrapy項目的目錄結構如下所示:
結構圖 - settings.py源碼如下:
- scrapy項目的目錄結構如下所示:
# -*- coding: utf-8 -*-
# Scrapy settings for todayNews project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'todayNews'
SPIDER_MODULES = ['todayNews.spiders']
NEWSPIDER_MODULE = 'todayNews.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'todayNews (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept':'text/javascript, text/html, application/xml, text/xml, */*',
'Accept-Encoding':'gzip, deflate, sdch, br',
'Accept-Language':'zh-CN,zh;q=0.8',
'Cache-Control':'no-cache',
'Connection':'keep-alive',
'Content-Type':'application/x-www-form-urlencoded',
'Cookie':'uuid="w:3db0708ea2c549fab1a5371c56f16176"; UM_distinctid=15c7147fecd8d-0a4277451-4349052c-100200-15c7147fecf6f; csrftoken=af9a5a0d4cd30794e6c04511ca9f31eb; _ga=GA1.2.312467779.1496549163; __guid=32687416.738502311042654200.1505560389379.9048; tt_track_id=c7baa73a99ec9787ead7a2f6b01ff56b; _ba=BA0.2-20170923-51d9e-ErxmsyZIIoxNOzZgf6Us; tt_webid=6427627096743282178; WEATHER_CITY=%E5%8C%97%E4%BA%AC; CNZZDATA1259612802=610804389-1496543540-null%7C1506261975; __tasessionId=0vta7k1uc1506263833592; tt_webid=6427627096743282178',
'Host':'www.toutiao.com',
'Pragma':'no-cache',
'Referer':'https://www.toutiao.com/ch/news_tech/',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'todayNews.middlewares.TodaynewsSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'todayNews.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'todayNews.pipelines.MongoPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
DOWNLOAD_DELAY = 1
MONGO_URI="localhost"
MONGO_DATABASE="toutiao"
MONGO_USER="username"
MONGO_PASS="password"
- pipelines源碼如下:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
class MongoPipeline(object):
collection_name="science"
def __init__(self,mongo_uri,mongo_db,mongo_user,mongo_pass):
self.mongo_uri=mongo_uri
self.mongo_db=mongo_db
self.mongo_user=mongo_user
self.mongo_pass=mongo_pass
@classmethod
def from_crawler(cls,crawler):
return cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DATABASE'),mongo_user=crawler.settings.get("MONGO_USER"),mongo_pass=crawler.settings.get("MONGO_PASS"))
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
self.db.authenticate(self.mongo_user,self.mongo_pass)
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
# self.db[self.collection_name].update({'url_token': item['url_token']}, {'$set': dict(item)}, True)
# return item
self.db[self.collection_name].insert(dict(item))
return item
- toutiao.py源碼如下:
# -*- coding: utf-8 -*-
from scrapy import Spider,Request
import json
import logging
from todayNews.items import TodaynewsItem
class ToutiaoSpider(Spider):
name = "toutiao"
allowed_domains = ["www.toutiao.com"]
start_urls = ['https://www.toutiao.com/api/pc/feed/?min_behot_time=0&category=__all__&utm_source=toutiao&widen=1&tadrequire=true&as=A1D5394CB72C38F&cp=59C71C03883F0E1']
url='https://www.toutiao.com/api/pc/feed/?category=news_tech&utm_source=toutiao&widen=1&max_behot_time={behot_time}&max_behot_time_tmp={behot_time_tmp}&tadrequire=true&as=A165E92C97CC487&cp=59C74CC4E8F7BE1'
def parse(self, response):
jsonData=json.loads(response.body.decode("utf-8"))
MainData=jsonData["data"]
nextTime=jsonData["next"]["max_behot_time"]
if jsonData["message"]=='success':
for rowData in MainData:
yield rowData
yield Request(url=self.url.format(behot_time=nextTime,behot_time_tmp=nextTime),callback=self.parse)
else:
logging.info("The Data is null")
- items定義數據結構化的提取,因為今日頭條返回的json格式并不是規范(可以查閱上面展示的數據),所以并沒有定義提取的item值。而是直接把items傳遞到pipeline梳理保存在MongoDB上面。
-
4.啟動爬蟲程序,并查看爬取到數據
保存的數據
完工