爬蟲實戰第三天
任務
爬取小豬短租北京地區(http://bj.xiaozhu.com/) 租房信息(前三頁)。
成果
將爬取的信息寫入到了MongoDB中,并且查詢了價格大于等于500/晚的租房信息。
源碼
from bs4 import BeautifulSoup
from pymongo import MongoClient
import requests
pages = ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(str(i)) for i in range(1, 4)]
info = []
client = MongoClient('localhost', 27017)
xiao_zhu = client['xiao_zhu']
xiao_zhu_sheet = xiao_zhu['xiao_zhu_sheet']
def get_info(url):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')
data = {
'title': soup.select('div.pho_info > h4 > em')[0].get_text(),
'address': soup.select('div.pho_info > p > span')[0].get_text().strip(' ').strip('\n'),
'price': int(soup.select('#pricePart > div.day_l > span')[0].get_text()),
# 圖片鏈接在chrome中不是直接打開而是下載,在IE中可以直接打開
'house_image': soup.select('#curBigImage')[0]['src'],
'master_name': soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')[0]['title'],
'master_sex': soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > span')[0]['class'][0].split('_')[1],
'master_image': soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')[0]['src']
}
xiao_zhu_sheet.insert_one(data)
def get_url(start_url):
wb_data = requests.get(start_url)
soup = BeautifulSoup(wb_data.text, 'lxml')
urls = soup.select('#page_list > ul > li > a')
return urls
for page in pages:
urls = get_url(page)
for url in urls:
try:
get_info(url['href'])
except Exception as e:
pass
'''
'price'關鍵字的屬性類型必須為數值型,這樣才能比較大小
$lt/$lte/$gt/$gte/$ne (l == less g == greater e == equal n == not)
使用print(type(xiao_zhu_sheet.find({'price': {'$gte': 500}})[0]))發現每個item實際上是一個dict
'''
for item in xiao_zhu_sheet.find({'price': {'$gte': 500}}):
print(item)
小結
- Pymongo操作MongoDB首先建立client連接(感覺有點類似于MySQL中的conn??),然后通過連接用python進行操作MongDB,建立具體的db和collection。
- Pymongo具體語法參考: http://api.mongodb.com/python/current/