Pythonic Data Cleaning With NumPy and Pandas
生詞釋意
a handful of columns 少量字段
roughly 初略的 大體的
enforce 強(qiáng)迫實(shí)施 執(zhí)行
github庫(kù)
https://github.com/realpython/python-data-cleaning
數(shù)據(jù)集
- BL-Flickr-Images-Book.csv – A CSV file containing information about books from the British Library
- university_towns.txt – A text file containing names of college towns in every US state
- olympics.csv – A CSV file summarizing the participation of all countries in the Summer and Winter Olympics
這篇文章中主要有以下幾個(gè)content
數(shù)據(jù)集讀取
按需刪除字段
清理字段
>>> import pandas as pd
>>> import numpy as np
Dropping Columns in a DataFrame
Often, you’ll find that not all the categories of data in a dataset are useful to you. For example, you might have a dataset containing student information (name, grade, standard, parents’ names, and address) but want to focus on analyzing student grades.
In this case, the address or parents’ names categories are not important to you. Retaining these unneeded categories will take up unnecessary space and potentially also bog down runtime.
Pandas provides a handy way of removing unwanted columns or rows from a DataFrame
with the drop()
function. Let’s look at a simple example where we drop a number of columns from a DataFrame
.
讀取數(shù)據(jù)集
First, let’s create a DataFrame
out of the CSV file ‘BL-Flickr-Images-Book.csv’. In the examples below, we pass a relative path to pd.read_csv
, meaning that all of the datasets are in a folder named Datasets
in our current working directory:
When we look at the first five entries using the head() method, we can see that a handful of columns provide ancillary information that would be helpful to the library but isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks.
我們使用head()方法查看數(shù)據(jù)集的前幾列基本信息。只有少量的字段對(duì)數(shù)據(jù)是有用的。
通過(guò)以下方式刪除
>>> to_drop = ['Edition Statement',
... 'Corporate Author',
... 'Corporate Contributors',
... 'Former owner',
... 'Engraver',
... 'Contributors',
... 'Issuance type',
... 'Shelfmarks']
>>> df.drop(to_drop, inplace=True, axis=1)
Above, we defined a list that contains the names of all the columns we want to drop. Next, we call the drop() function on our object, passing in the inplace parameter as True and the axis parameter as 1. This tells Pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object.
我們把需要?jiǎng)h除的列,單獨(dú)以列表的形式,傳遞給drop方法,即可刪除
When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed:
重新查看列數(shù)
使用定位函數(shù)查看
>>> df.loc[206]
Place of Publication London
Date of Publication 1879 [1878]
Publisher S. Tinsley & Co.
Title Walter Forbes. [A novel.] By A. A
Author A. A.
Flickr URL http://www.flickr.com/photos/britishlibrary/ta...
Name: 206, dtype: object
Tidying up Fields in the Data 整理字段
So far, we have removed unnecessary columns and changed the index of our DataFrame
to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication
and Place of Publication
.
Upon inspection, all of the data types are currently the object
dtype, which is roughly analogous to str
in native Python.
It encapsulates any field that can’t be neatly fit as numerical or categorical data. This makes sense since we’re working with data that is initially a bunch of messy strings:
>>> df.get_dtype_counts()
object 6
get_dtype_counts 返回此對(duì)象中唯一dtypes的計(jì)數(shù).
如果一列中含有多個(gè)類型,則該列的類型會(huì)是object,同樣字符串類型的列也會(huì)被當(dāng)成object類型.
One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road:
出版日期應(yīng)該強(qiáng)制轉(zhuǎn)換為數(shù)字型,方便后續(xù)做計(jì)算
df.loc[1905:, 'Date of Publication'].head(10)
Identifier
1905 1888
1929 1839, 38-54
2836 [1897?]
2854 1865
2956 1860-63
2957 1873
3017 1866
3131 1899
4598 1814
4884 1820
Name: Date of Publication, dtype: object
A particular book can have only one date of publication. Therefore, we need to do the following:
一本確定的書,僅有一個(gè)確定的出版日期,因此我們需要做以下操作:
Remove the extra dates in square brackets, wherever present: 1879 [1878]
移除中括號(hào)內(nèi)額外的日期
Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]
完全清除不確定的日期,用NumPy的NaN類型替代
Convert the string nan to NumPy’s NaN value
轉(zhuǎn)換string nan為 NumPy’s NaN
統(tǒng)計(jì)數(shù)據(jù)每列為空的數(shù)據(jù)個(gè)數(shù)的統(tǒng)計(jì)
df.isnull().sum()
查看數(shù)據(jù)的類型統(tǒng)計(jì)
df.get_dtype_counts()
dataframe 的時(shí)候 發(fā)現(xiàn)所有string 類型的 column 都是object類型
原文中還有一部分關(guān)于數(shù)據(jù)清理的操作,下篇文章繼續(xù)翻譯和解讀。