Pythonic Data Cleaning With NumPy and Pandas【一】

Pythonic Data Cleaning With NumPy and Pandas

生詞釋意

a handful of columns 少量字段
roughly 初略的 大體的
enforce 強迫實施 執(zhí)行

github庫

https://github.com/realpython/python-data-cleaning

數據集

  • BL-Flickr-Images-Book.csv – A CSV file containing information about books from the British Library
  • university_towns.txt – A text file containing names of college towns in every US state
  • olympics.csv – A CSV file summarizing the participation of all countries in the Summer and Winter Olympics

這篇文章中主要有以下幾個content

數據集讀取
按需刪除字段
清理字段

>>> import pandas as pd
>>> import numpy as np

Dropping Columns in a DataFrame

Often, you’ll find that not all the categories of data in a dataset are useful to you. For example, you might have a dataset containing student information (name, grade, standard, parents’ names, and address) but want to focus on analyzing student grades.

In this case, the address or parents’ names categories are not important to you. Retaining these unneeded categories will take up unnecessary space and potentially also bog down runtime.

Pandas provides a handy way of removing unwanted columns or rows from a DataFramewith the drop() function. Let’s look at a simple example where we drop a number of columns from a DataFrame.

讀取數據集

First, let’s create a DataFrame out of the CSV file ‘BL-Flickr-Images-Book.csv’. In the examples below, we pass a relative path to pd.read_csv, meaning that all of the datasets are in a folder named Datasets in our current working directory:

When we look at the first five entries using the head() method, we can see that a handful of columns provide ancillary information that would be helpful to the library but isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks.

我們使用head()方法查看數據集的前幾列基本信息。只有少量的字段對數據是有用的。

通過以下方式刪除

>>> to_drop = ['Edition Statement',
...            'Corporate Author',
...            'Corporate Contributors',
...            'Former owner',
...            'Engraver',
...            'Contributors',
...            'Issuance type',
...            'Shelfmarks']

>>> df.drop(to_drop, inplace=True, axis=1)

Above, we defined a list that contains the names of all the columns we want to drop. Next, we call the drop() function on our object, passing in the inplace parameter as True and the axis parameter as 1. This tells Pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object.

我們把需要刪除的列,單獨以列表的形式,傳遞給drop方法,即可刪除

When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed:

重新查看列數

使用定位函數查看

>>> df.loc[206]
Place of Publication                                               London
Date of Publication                                           1879 [1878]
Publisher                                                S. Tinsley & Co.
Title                                   Walter Forbes. [A novel.] By A. A
Author                                                              A. A.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 206, dtype: object

Tidying up Fields in the Data 整理字段

So far, we have removed unnecessary columns and changed the index of our DataFrame to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication and Place of Publication.

Upon inspection, all of the data types are currently the object dtype, which is roughly analogous to str in native Python.

It encapsulates any field that can’t be neatly fit as numerical or categorical data. This makes sense since we’re working with data that is initially a bunch of messy strings:

>>> df.get_dtype_counts()
object    6

get_dtype_counts 返回此對象中唯一dtypes的計數.
如果一列中含有多個類型,則該列的類型會是object,同樣字符串類型的列也會被當成object類型.

One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road:

出版日期應該強制轉換為數字型,方便后續(xù)做計算

df.loc[1905:, 'Date of Publication'].head(10)
Identifier
1905           1888
1929    1839, 38-54
2836        [1897?]
2854           1865
2956        1860-63
2957           1873
3017           1866
3131           1899
4598           1814
4884           1820
Name: Date of Publication, dtype: object

A particular book can have only one date of publication. Therefore, we need to do the following:

一本確定的書,僅有一個確定的出版日期,因此我們需要做以下操作:

Remove the extra dates in square brackets, wherever present: 1879 [1878]
移除中括號內額外的日期

Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54

Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]

完全清除不確定的日期,用NumPy的NaN類型替代

Convert the string nan to NumPy’s NaN value

轉換string nan為 NumPy’s NaN

統(tǒng)計數據每列為空的數據個數的統(tǒng)計

df.isnull().sum()

查看數據的類型統(tǒng)計

df.get_dtype_counts() 

dataframe 的時候 發(fā)現所有string 類型的 column 都是object類型

原文中還有一部分關于數據清理的操作,下篇文章繼續(xù)翻譯和解讀。

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容