1.任務
把下面網頁中的表格數據解析成pandas數據
https://en.wikipedia.org/wiki/Harvard_University
Paste_Image.png
2.方法
- 獲取數據
import requests
response = requests.get('https://en.wikipedia.org/wiki/Harvard_University')
- 獲取表格
from lxml import etree
html = etree.HTML(response.text)
table = etree.xpath('//table[@class="wikitable"]')[0]
- 解析表格中的數據
tr_array = table.findall('tr')
texts = []
for tr in tr_array:
line = []
for c in tr.iterchildren():
line.append(c.text)
texts.append(line)
- 從文本中解析列名和索引
col_names = texts[0][1:]
index_names = [t[0] for t in texts[1:]]
- 數據轉換
values = []
for line in texts[1:]:
row = []
for v in line[1:]:
v = v.strip()
if v == 'N/A':
v = None
elif v.endswith('%'):
v = int(v[:v.rfind('%')])
row.append(v)
values.append(row)
- 把數據轉換為DataFrame
import pandas as pd
students = pd.DataFrame(values,columns=col_names,index=index_names)
數據轉換
- 對于數據問題
第三列Census數據中有NaN,而且這列的數據類型是浮點數
>students.dtypes
Undergraduate int64
Graduate int64
U.S. Census float64
dtype: object
把數據NAN轉為0,并把數據類型轉換為int
dfclearn = students.fillna(0).astype('int64')
數據類型轉換