對數(shù)據(jù)集進行分組并對各組應用一個函數(shù)(聚合或者轉換),是數(shù)據(jù)分析工作重要環(huán)節(jié)。數(shù)據(jù)集準備好之后,就是計算分組統(tǒng)計或生成透視表。
pandas提供了一個靈活高效的groupby功能,可以對數(shù)據(jù)集進行切片、切塊、摘要等操作。
本章內容:
根據(jù)一個或多個鍵(可以是函數(shù),數(shù)組或DataFrame列名)拆分pandas對象。
計算分組統(tǒng)計摘要,如計數(shù)、平均值、標準差、或用戶自定義函數(shù)。
對DataFrame的列應用各種各樣的函數(shù)
應用組內轉換或其他運算,如規(guī)格化、線性回歸、排名、選取子集等
計算透視表或交叉表
執(zhí)行分位數(shù)分析以及其他分組分析
9.1 Group By技術
分組運算的術語(split-apply-combine)拆分-應用-合并。
第一階段,我們提供的鍵會把pandas對象(無論是Series,DataFrame)中的數(shù)據(jù)拆分為多組。拆分操作是在對象的特定軸上執(zhí)行。
[圖片上傳中。。。(1)]
分組鍵可以有多種形式,且類型不必相同:
列表或數(shù)組,其長度與待分組的軸一樣;
表示DataFrame某個列名的值;
字典或Series給出待分組軸上的值與分組之間的對應關系;
函數(shù),用于處理索引或索引中的各個標簽;
#訪問data1,根據(jù)key1調用groupby
In [3]: import numpy as np
...: import pandas as pd
...: import matplotlib.pyplot as plt
...: from pandas import Series,DataFrame
...:
In [4]: df =DataFrame({'key1':list('aabba'),'key2':['one','two','one','two','one'],
...: 'data1':np.random.randn(5),'data2':np.random.randn(5)})
In [5]: df
Out[5]:
data1 data2 key1 key2
0 -0.713865 -0.508708 a one
1 -0.001112 -0.431989 a two
2 -1.845435 1.631306 b one
3 1.158896 -1.145442 b two
4 -0.555897 -2.520632 a one
#訪問data1,并根據(jù)key1調用groupby
In [6]: grouped=df['data1'].groupby(df['key1'])
In [7]: grouped
Out[7]: <pandas.core.groupby.SeriesGroupBy object at 0x000000000A1ADA90>
#調用GroupBy的mean方法來計算分組平均值
In [8]: grouped.mean()
Out[8]:
key1
a 0.03546
b -1.77451
Name: data1, dtype: float64
In [10]: means=df['data1'].groupby([df['key1'],df['key2']]).mean()
In [11]: means
Out[11]:
key1 key2
a one 0.771713
two -1.201122
b one -0.495424
two 0.955653
Name: data1, dtype: float64
In [12]: means.unstack()
Out[12]:
key2 one two
key1
a 0.771713 -1.201122
b -0.495424 0.955653
In [14]: states=np.array(['Ohio','California','California','Ohio','Ohio'])
In [15]: yaers=np.array([2005,2005,2006,2005,2006])
In [16]: years=np.array([2005,2005,2006,2005,2006])
In [18]: df['data1'].groupby([states,years]).mean()
Out[18]:
California 2005 -1.201122
2006 -0.495424
Ohio 2005 0.606807
2006 1.285464
Name: data1, dtype: float64
#將列名用作分組鍵
In [21]: df.groupby('key1').mean()
Out[21]:
data1 data2
key1
a 0.114101 -0.572603
b 0.230114 -0.583885
In [22]: df.groupby(['key1','key2']).mean()
Out[22]:
data1 data2
key1 key2
a one 0.771713 -0.444467
two -1.201122 -0.828875
b one -0.495424 1.384597
two 0.955653 -2.552366
#返回一個含有分組大小的Series
In [23]: df.groupby(['key1','key2']).size()
Out[23]:
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
9.1.2 對分組進行迭代
#產(chǎn)生一組二元元組
In [9]: for name,group in df.groupby('key1'):
...: print name
...: print group
...:
a
data1 data2 key1 key2
0 -0.713865 -0.508708 a one
1 -0.001112 -0.431989 a two
4 -0.555897 -2.520632 a one
b
data1 data2 key1 key2
2 -1.845435 1.631306 b one
3 1.158896 -1.145442 b two
#元組的第一個元素將會由鍵值組成的元組
In [10]: for (k1,k2), group in df.groupby(['key1','key2']):
...: print k1,k2
...: print group
...:
a one
data1 data2 key1 key2
0 -0.713865 -0.508708 a one
4 -0.555897 -2.520632 a one
a two
data1 data2 key1 key2
1 -0.001112 -0.431989 a two
b one
data1 data2 key1 key2
2 -1.845435 1.631306 b one
b two
data1 data2 key1 key2
3 1.158896 -1.145442 b two
#將這些數(shù)據(jù)片段做成一個字典
In [12]: pieces=dict(list(df.groupby('key1')))
In [13]: pieces['b']
Out[13]:
data1 data2 key1 key2
2 -1.845435 1.631306 b one
3 1.158896 -1.145442 b two
groupby默認是在axis=0上進行分組,通設置也可以在其他任何軸上進行分組
In [14]: df.dtypes
Out[14]:
data1 float64
data2 float64
key1 object
key2 object
dtype: object
In [15]: groupbyed=df.groupby(df.dtypes,axis=1)
In [17]: grouped=df.groupby(df.dtypes,axis=1)
In [18]: dict(list(grouped))
Out[18]:
{dtype('float64'): data1 data2
0 -0.713865 -0.508708
1 -0.001112 -0.431989
2 -1.845435 1.631306
3 1.158896 -1.145442
4 -0.555897 -2.520632, dtype('O'): key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one}
9.1.3 選取一個或一組列
對于由DataFrame產(chǎn)生的GroupBy對象,如果用一個(單個字符串)或一組(字符串數(shù)組)列名對其進行索引,就能實現(xiàn)選取部分列進行聚合。
df.groupby('key1')['data1']
df.groupby('key1')['data2']
是以下代碼的語法糖:
df['data1'].groupby(df['key1'])
df['data2'].groupby(df['key1'])
尤其對于大數(shù)據(jù)集,可能只需要對部分列進行聚合。
對于前面的出現(xiàn)的那個數(shù)據(jù)集,如果只需要計算data2列的平均值并以DataFrame的形式得到結果,我們可以寫出:
In [2]: df =DataFrame({'key1':list('aabba'),'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
In [4]: df
Out[4]:
data1 data2 key1 key2
0 -0.119115 -1.660365 a one
1 -1.677104 1.664901 a two
2 -0.124288 0.991688 b one
3 -0.357859 -0.645814 b two
4 -0.627007 -0.340816 a one
In [5]: df.groupby(['key1','key2'])[['data2']].mean()
Out[5]:
data2
key1 key2
a one -1.000590
two 1.664901
b one 0.991688
two -0.645814
這種索引操作所返回的對象是一個已分組的DataFrame(如果傳入的是列表或數(shù)組)或已分組的Series(如果傳入的是標量形式的單個列名):
In [6]: s_grouped=df.groupby(['key1','key2'])['data2']
In [7]: s_grouped
Out[7]: <pandas.core.groupby.SeriesGroupBy object at 0x000000000B6345F8>
In [8]: s_grouped.mean()
Out[8]:
key1 key2
a one -1.000590
two 1.664901
b one 0.991688
two -0.645814
Name: data2, dtype: float64
9.1.4 通過字典或Series進行分組
除數(shù)組外,分組信息還可以以其他形式存在。
In [9]: people =DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])
In [10]: people
Out[10]:
a b c d e
Joe 0.153422 -1.740001 -1.814139 0.358241 -0.130256
Steve -0.870311 1.199198 0.275245 0.160661 0.144324
Wes -0.298472 0.472300 1.070169 0.899584 2.011791
Jim -0.638032 -0.011376 0.685198 0.625192 1.335396
Travis -2.482942 1.661548 1.284279 -1.061266 -0.632708
#添加幾個NA值
In [11]: people.ix[2:3,['b','c']]=np.nan
In [12]: people
Out[12]:
a b c d e
Joe 0.153422 -1.740001 -1.814139 0.358241 -0.130256
Steve -0.870311 1.199198 0.275245 0.160661 0.144324
Wes -0.298472 NaN NaN 0.899584 2.011791
Jim -0.638032 -0.011376 0.685198 0.625192 1.335396
Travis -2.482942 1.661548 1.284279 -1.061266 -0.632708
假設已知列的關系,并希望根據(jù)分組計算列的總和。
In [13]: mappings={'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
In [15]: by_column=people.groupby(mappings,axis=1)
In [16]: by_column
Out[16]: <pandas.core.groupby.DataFrameGroupBy object at 0x000000000B634D68>
In [17]: by_column.sum()
Out[17]:
blue red
Joe -1.455897 -1.716835
Steve 0.435906 0.473211
Wes 0.899584 1.713320
Jim 1.310389 0.685988
Travis 0.223013 -1.454102
Series也有著同樣的功能,可以被看做是一個大的映射。
In [19]: map_series=Series(mappings)
In [20]: map_series
Out[20]:
a red
b red
c blue
d blue
e red
f orange
dtype: object
In [22]: people.groupby(map_series,axis=1).count()
Out[22]:
blue red
Joe 2 3
Steve 2 3
Wes 1 2
Jim 2 3
Travis 2 3
9.1.5 通過函數(shù)進行分組
相較于字典或Series,python函數(shù)在定義分組映射問題上更有創(chuàng)意更為抽象。
任何被當做分組鍵的函數(shù)都會在各個索引值上被調用一次,其返回值就會被用作分組名稱。
下面的例子是假設我們想根據(jù)人名的長度進行分組,雖然可以求取一個字符串的長度數(shù)組,但是其實僅僅傳入len函數(shù)就可以了:
In [28]: people.groupby(len).sum()
Out[28]:
a b c d e
3 -0.783082 -1.751376 -1.128941 1.883017 3.216931
5 -0.870311 1.199198 0.275245 0.160661 0.144324
6 -2.482942 1.661548 1.284279 -1.061266 -0.632708
In [31]: key_list=['one','one','one','two','two']
In [32]: people.groupby([len,key_list]).min()
Out[32]:
a b c d e
3 one -0.298472 -1.740001 -1.814139 0.358241 -0.130256
two -0.638032 -0.011376 0.685198 0.625192 1.335396
5 one -0.870311 1.199198 0.275245 0.160661 0.144324
6 two -2.482942 1.661548 1.284279 -1.061266 -0.632708
9.1.6 根據(jù)索引級別分組
層次化索引數(shù)據(jù)集最方便的地方就是在于能夠根據(jù)索引級別禁止聚合。
如果實現(xiàn)該目的,通過level關鍵字傳入級別編號或名稱即可:
In [33]: columns=pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3]],names=['city','tenor'])
In [34]: hier_df=DataFrame(np.random.randn(4,5),columns=columns)
In [35]: columns
Out[35]:
MultiIndex(levels=[[u'JP', u'US'], [1, 3, 5]],
labels=[[1, 1, 1, 0, 0], [0, 1, 2, 0, 1]],
names=[u'city', u'tenor'])
In [36]: hier_df=DataFrame(np.random.randn(4,5),columns=columns)
In [37]: hier_df
Out[37]:
city US JP
tenor 1 3 5 1 3
0 1.416009 -0.016826 1.498950 1.010254 1.757742
1 -0.528243 -1.113364 0.120569 0.209329 0.260765
2 0.540845 2.198479 -1.307002 -0.545171 -0.378676
3 -0.625421 -0.960389 -1.435062 1.851948 0.210522
In [38]: hier_df.groupby(level='city',axis=1).count()
Out[38]:
city JP US
0 2 3
1 2 3
2 2 3
3 2 3
9.2 數(shù)據(jù)聚合
數(shù)據(jù)聚合指的是任何能夠從發(fā)數(shù)組產(chǎn)生標量的值得數(shù)據(jù)轉換過程。
In [1]: import numpy as np
...: import pandas as pd
...: import matplotlib.pyplot as plt
...: from pandas import Series,DataFrame
...:
In [2]: df =DataFrame({'key1':list('aabba'),'key2':['one','two','one','two','one'],
...: ...: 'data1':np.random.randn(5),'data2':np.random.randn(5)})
In [3]: df
Out[3]:
data1 data2 key1 key2
0 -1.828647 0.329238 a one
1 -0.639952 1.532362 a two
2 0.617105 0.906281 b one
3 1.443470 1.419738 b two
4 -2.031339 -0.649743 a one
In [4]: grouped = df.groupby('key1')
#quantile用于計算Series或DataFrame列的樣本分位數(shù)
In [5]: grouped['data1'].quantile(0.9)
Out[5]:
key1
a -0.877691
b 1.360834
Name: data1, dtype: float64
#定義了一個函數(shù),傳入agg方法
In [6]: def peak_to_peak(arr):
...: return arr.max() -arr.min()
In [8]: grouped.agg(peak_to_peak)
Out[8]:
data1 data2
key1
a 1.391387 2.182105
b 0.826366 0.513457
In [9]: grouped.describe()
Out[9]:
data1 data2
key1
a count 3.000000 3.000000
mean -1.499979 0.403952
std 0.751669 1.092969
min -2.031339 -0.649743
25% -1.929993 -0.160253
50% -1.828647 0.329238
75% -1.234300 0.930800
max -0.639952 1.532362
b count 2.000000 2.000000
mean 1.030287 1.163010
std 0.584329 0.363069
min 0.617105 0.906281
25% 0.823696 1.034645
50% 1.030287 1.163010
75% 1.236879 1.291374
max 1.443470 1.419738
[圖片上傳中。。。(2)]
In [10]: tips = pd.read_csv(r"E:\python\pydata-book-master\ch08\tips.csv")
In [11]: tips[:4]
Out[11]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
#添加“小費占總額的百分比”的列
In [12]: tips['tip_pct'] = tips['tip']/tips['total_bill']
In [13]: tips[:4]
Out[13]:
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
9.2.1面向列的多函數(shù)應用
對Series和DataFrame列的聚合運算就是使用aggregate(使用自定義函數(shù))或者調研如mean,std之類的方法。
我們將繼續(xù)上述數(shù)據(jù)集的例子,對不同的列使用不同的聚合函數(shù)。
In [14]: grouped = tips.groupby(['sex','smoker'])
In [19]: grouped_pct = grouped['tip_pct']
In [20]: grouped_pct.agg('mean')
Out[20]:
sex smoker
Female No 0.156921
Yes 0.182150
Male No 0.160669
Yes 0.152771
Name: tip_pct, dtype: float64
#如果傳入一組函數(shù)名,數(shù)據(jù)集的列就會以相應的函數(shù)命名
In [24]: grouped_pct.agg(['mean','std',peak_to_peak])
Out[24]:
mean std peak_to_peak
sex smoker
Female No 0.156921 0.036421 0.195876
Yes 0.182150 0.071595 0.360233
Male No 0.160669 0.041849 0.220186
Yes 0.152771 0.090588 0.674707
#如果傳入的是一個由(name,function)元組組成的列表,則各元組的第一個元素就會被用作數(shù)據(jù)#集的列名
In [27]: grouped_pct.agg([('foo','mean'),('bar',np.std)])
Out[27]:
foo bar
sex smoker
Female No 0.156921 0.036421
Yes 0.182150 0.071595
Male No 0.160669 0.041849
Yes 0.152771 0.090588
#對tip_pct 和 total_bill 列計算三個統(tǒng)計信息
In [27]: grouped_pct.agg([('foo','mean'),('bar',np.std)])
Out[27]:
foo bar
sex smoker
Female No 0.156921 0.036421
Yes 0.182150 0.071595
Male No 0.160669 0.041849
Yes 0.152771 0.090588
In [28]: function = ['count','mean','max']
In [29]: result = grouped['tip_pct','total_bill'].agg(function)
In [30]: result
Out[30]:
tip_pct total_bill
count mean max count mean max
sex smoker
Female No 54 0.156921 0.252672 54 18.105185 35.83
Yes 33 0.182150 0.416667 33 17.977879 44.30
Male No 97 0.160669 0.291990 97 19.791237 48.33
Yes 60 0.152771 0.710345 60 22.284500
In [31]: result['tip_pct']
Out[31]:
count mean max
sex smoker
Female No 54 0.156921 0.252672
Yes 33 0.182150 0.416667
Male No 97 0.160669 0.291990
Yes 60 0.152771 0.710345
9.2.2 以“無索引”的方式返回聚合數(shù)據(jù)
#傳入as_index = False禁用聚合數(shù)據(jù)有唯一的分組鍵組成的索引
In [32]: tips.groupby(['sex','smoker'],as_index=False).mean()
Out[32]:
sex smoker total_bill tip size tip_pct
0 Female No 18.105185 2.773519 2.592593 0.156921
1 Female Yes 17.977879 2.931515 2.242424 0.182150
2 Male No 19.791237 3.113402 2.711340 0.160669
3 Male Yes 22.284500 3.051167 2.500000 0.152771
9.3 分組級運算和轉換
聚合是分組運算的一個特例。它接受能夠將一維數(shù)組簡化的為標量值的函數(shù)。
下列是為數(shù)據(jù)集添加一個用于存放各索引分組平均值的列,先聚合再合并
In [33]: df
Out[33]:
data1 data2 key1 key2
0 -1.828647 0.329238 a one
1 -0.639952 1.532362 a two
2 0.617105 0.906281 b one
3 1.443470 1.419738 b two
4 -2.031339 -0.649743 a one
In [34]: k1_means = df.groupby('key1').mean().add_prefix('mean_')
In [35]: k1_means
Out[35]:
mean_data1 mean_data2
key1
a -1.499979 0.403952
b 1.030287 1.163010
In [36]: pd.merge(df,k1_means,left_on = 'key1',right_index = True)
Out[36]:
data1 data2 key1 key2 mean_data1 mean_data2
0 -1.828647 0.329238 a one -1.499979 0.403952
1 -0.639952 1.532362 a two -1.499979 0.403952
4 -2.031339 -0.649743 a one -1.499979 0.403952
2 0.617105 0.906281 b one 1.030287 1.163010
3 1.443470 1.419738 b two 1.030287 1.163010
我們使用一下其他方法:
In [37]: key = ['one','two','one','two','one']
In [41]: people =DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])
In [42]: people
Out[42]:
a b c d e
Joe -1.264617 -1.385894 0.146627 -1.225148 0.627616
Steve 0.880528 0.530060 0.453235 1.160768 -0.053416
Wes 1.033023 -0.859791 -0.629231 -1.094454 -2.073512
Jim 1.777919 -0.864824 -1.940994 -0.806969 0.504503
Travis -1.260144 -0.486910 1.180371 -0.214743 0.629261
In [43]: people.groupby(key).mean()
Out[43]:
a b c d e
one -0.497246 -0.910865 0.232589 -0.844782 -0.272212
two 1.329224 -0.167382 -0.743879 0.176899 0.225544
In [44]: people.groupby(key).transform(np.mean)
Out[44]:
a b c d e
Joe -0.497246 -0.910865 0.232589 -0.844782 -0.272212
Steve 1.329224 -0.167382 -0.743879 0.176899 0.225544
Wes -0.497246 -0.910865 0.232589 -0.844782 -0.272212
Jim 1.329224 -0.167382 -0.743879 0.176899 0.225544
Travis -0.497246 -0.910865 0.232589 -0.844782 -0.272212
In [45]: def demean(arr):
...: return arr-arr.mean()
In [46]: demeaned = people.groupby(key).transform(demean)
In [47]: demeaned
Out[47]:
a b c d e
Joe -0.767371 -0.475029 -0.085962 -0.380366 0.899828
Steve -0.448695 0.697442 1.197114 0.983868 -0.278960
Wes 1.530269 0.051074 -0.861820 -0.249672 -1.801300
Jim 0.448695 -0.697442 -1.197114 -0.983868 0.278960
Travis -0.762898 0.423955 0.947782 0.630038 0.901473
#檢查一下demeaned現(xiàn)在的分組的平均值是否為0
In [48]: demeaned.groupby(key).mean()
Out[48]:
a b c d e
one 0.0 5.551115e-17 0.0 -1.480297e-16 0.0
two 0.0 0.000000e+00 0.0 0.000000e+00 0.0
9.3.1 apply:一般性的“拆分-應用-合并”
transform是一個嚴格的條件的特殊函數(shù):傳入的函數(shù)只能產(chǎn)生兩種結果,要么產(chǎn)生一個可以廣播的標量值,要么產(chǎn)生一個相同大小的結果數(shù)組。
apply會將待處理的對象拆分為多個片段,然后對各片段調用傳入的函數(shù),最后嘗試將各個片段組合在一起。
假設我們想從小費那個數(shù)據(jù)集里面選出5個最高的tip_pct值。
#先排序,然后取其中最前面的5個
In [52]: def top(df,n=5,column = 'tip_pct'):
...: return df.sort_index(by=column)[-n:]
In [53]: top(tips,n=6)
__main__:2: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
Out[53]:
total_bill tip sex smoker day time size tip_pct
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
In [54]: tips.groupby('smoker').apply(top)
__main__:2: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
Out[54]:
total_bill tip sex smoker day time size tip_pct
smoker
No 88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
#加一點特殊條件
In [55]: tips.groupby(['smoker','day']).apply(top,n=1,column = 'total_bill')
__main__:2: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
Out[55]:
total_bill tip sex smoker day time size \
smoker day
No Fri 94 22.75 3.25 Female No Fri Dinner 2
Sat 212 48.33 9.00 Male No Sat Dinner 4
Sun 156 48.17 5.00 Male No Sun Dinner 6
Thur 142 41.19 5.00 Male No Thur Lunch 5
Yes Fri 95 40.17 4.73 Male Yes Fri Dinner 4
Sat 170 50.81 10.00 Male Yes Sat Dinner 3
Sun 182 45.35 3.50 Male Yes Sun Dinner 3
Thur 197 43.11 5.00 Female Yes Thur Lunch 4
tip_pct
smoker day
No Fri 94 0.142857
Sat 212 0.186220
Sun 156 0.103799
Thur 142 0.121389
Yes Fri 95 0.117750
Sat 170 0.196812
Sun 182 0.077178
Thur 197 0.115982
In [56]: result = tips.groupby('smoker')['tip_pct'].describe()
In [57]: result
Out[57]:
smoker
No count 151.000000
mean 0.159328
std 0.039910
min 0.056797
25% 0.136906
50% 0.155625
75% 0.185014
max 0.291990
Yes count 93.000000
mean 0.163196
std 0.085119
min 0.035638
25% 0.106771
50% 0.153846
75% 0.195059
max 0.710345
Name: tip_pct, dtype: float64
In [58]: result.unstack('smoker')
Out[58]:
smoker No Yes
count 151.000000 93.000000
mean 0.159328 0.163196
std 0.039910 0.085119
min 0.056797 0.035638
25% 0.136906 0.106771
50% 0.155625 0.153846
75% 0.185014 0.195059
max 0.291990 0.710345
其實上述的describe方法,就相當于下面的代碼:
f =lambda x: x.describe()
grouped.apply(f)
9.3.2 示例:用特定于分組的值來填充缺失值
有時候我們希望使用數(shù)據(jù)本身衍生出的值去填充NA值。
n [63]: from pandas import DataFrame,Series
In [64]: s = Series(np.random.randn(6))
#將其中幾個值填充為NAN
In [65]: s[::2]=np.nan
In [66]: s
Out[66]:
0 NaN
1 -1.884394
2 NaN
3 0.379894
4 NaN
5 0.588869
dtype: float64
#使用fillna把S的平均值填充進去
In [67]: s.fillna(s.mean())
Out[67]:
0 -0.305210
1 -1.884394
2 -0.305210
3 0.379894
4 -0.305210
5 0.588869
dtype: float64
#根據(jù)不同的分組來填充不同的值
In [6]: states = ['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho']
In [7]: group_key
Out[7]: ['East', 'East', 'East', 'East', 'West', 'West', 'West', 'West']
In [8]: data = Series(np.random.randn(8),index=states)
In [9]: data
Out[9]:
Ohio -1.537801
New York 0.263208
Vermont 0.500445
Florida -0.255887
Oregon 0.867263
Nevada -0.620590
California 0.593747
Idaho 2.501651
dtype: float64
In [11]: data.groupby(group_key).mean()
Out[11]:
East -0.257509
West 0.835518
dtype: float64
#用分組平均值去填充NA值
In [12]: fill_mean = lambda g:g.fillna(g.mean())
In [13]: data.groupby(group_key).apply(fill_mean)
Out[13]:
Ohio -1.537801
New York 0.263208
Vermont 0.500445
Florida -0.255887
Oregon 0.867263
Nevada -0.620590
California 0.593747
Idaho 2.501651
dtype: float64
In [14]: fill_values = {'East':0.5,'West':-1}
In [15]: fill_func = lambda g:g.fillna(fill_values[g.name])
In [16]: fill_func
Out[16]: <function __main__.<lambda>>
In [19]: data.groupby(group_key).apply(fill_func)
Out[19]:
Ohio -1.537801
New York 0.263208
Vermont 0.500445
Florida -0.255887
Oregon 0.867263
Nevada -0.620590
California 0.593747
Idaho 2.501651
dtype: float64
9.3.3 示例:分組加權平均數(shù)和相關系數(shù)
根據(jù)groupby的“拆分-應用-合并”范式,DataFrame的列雨列之間的或兩個Series之間的運算成為一種標準作業(yè)。
In [29]: df = DataFrame({'category':['a','a','a','a','b','b','b','b'],'data':np.random.randn(8),'weights':np.random.rand(8)})
In [30]: df
Out[30]:
category data weights
0 a 0.352124 0.131472
1 a -1.340416 0.605210
2 a -0.486105 0.835266
3 a 0.172995 0.013656
4 b 0.897209 0.879197
5 b 0.955620 0.414658
6 b -0.779258 0.850658
7 b -0.193639 0.738796
In [31]: grouped = df.groupby('category')
In [34]: get_wavg = lambda g: np.average(g['data'],weights = g['weights'])
In [35]: grouped.apply(get_wavg)
Out[35]:
category
a -0.737008
b 0.131494
dtype: float64
#讀取一個雅虎的數(shù)據(jù)集
In [55]: close_px = pd.read_csv(r'E:\python\pydata-book-master\ch09\stock_px.csv',parse_dates = True,index_col = 0)
In [56]: close_px[-4:]
Out[56]:
AAPL MSFT XOM SPX
2011-10-11 400.29 27.00 76.27 1195.54
2011-10-12 402.19 26.96 77.16 1207.25
2011-10-13 408.43 27.18 76.37 1203.66
2011-10-14 422.00 27.27 78.11 1224.58
In [57]: close_px[:4]
Out[57]:
AAPL MSFT XOM SPX
2003-01-02 7.40 21.11 29.22 909.03
2003-01-03 7.45 21.14 29.24 908.59
2003-01-06 7.45 21.52 29.96 929.01
2003-01-07 7.43 21.93 28.95 922.93
#計算一個由日收益率與SPX之間的年度相關系數(shù)組成的DataFrame
In [58]: rets = close_px.pct_change().dropna()
In [59]: spx_corr = lambda x: x.corrwith(x['SPX'])
In [60]: by_year = rets.groupby(lambda x: x.year)
In [61]: by_year.apply(spx_corr)
Out[61]:
AAPL MSFT XOM SPX
2003 0.541124 0.745174 0.661265 1.0
2004 0.374283 0.588531 0.557742 1.0
2005 0.467540 0.562374 0.631010 1.0
2006 0.428267 0.406126 0.518514 1.0
2007 0.508118 0.658770 0.786264 1.0
2008 0.681434 0.804626 0.828303 1.0
2009 0.707103 0.654902 0.797921 1.0
2010 0.710105 0.730118 0.839057 1.0
2011 0.691931 0.800996 0.859975 1.0
In [65]: by_year.apply(lambda g:g['AAPL'].corr(g['MSFT']))
Out[65]:
2003 0.480868
2004 0.259024
2005 0.300093
2006 0.161735
2007 0.417738
2008 0.611901
2009 0.432738
2010 0.571946
2011 0.581987
dtype: float64
9.3.4 示例:面向分組的線性回歸
定義一個regress函數(shù),對各數(shù)據(jù)塊執(zhí)行一個普通的最小二乘法(OLS)回歸。
In [66]: import statsmodels.api as sm
In [67]: def regress(data,yvar,xvar):
...: Y = data[yvar]
...: X = data[xvar]
...: X['intercept'] = 1
...: result = sm.OLS(Y,X).fit()
...: return result.params
...:
#計算AAPL對SPX收益率的線性回歸
In [68]: by_year.apply(regress,'AAPL',['SPX'])
Out[68]:
SPX intercept
2003 1.195406 0.000710
2004 1.363463 0.004201
2005 1.766415 0.003246
2006 1.645496 0.000080
2007 1.198761 0.003438
2008 0.968016 -0.001110
2009 0.879103 0.002954
2010 1.052608 0.001261
2011 0.806605 0.001514
9.4 透視表和交叉表
透視表(pivot table)是由各種電子表格程序和其他數(shù)據(jù)分析軟件一種常見的數(shù)據(jù)匯總工具。
In [3]: tips = pd.read_csv(r"E:\python\pydata-book-master\ch08\tips.csv")
In [5]: tips[:4]
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
In [6]: tips.pivot_table(rows=['sex','smoker'])
Traceback (most recent call last):
File "<ipython-input-4-de5618d53e2d>", line 1, in <module>
tips.pivot_table(rows=['sex','smoker'])
TypeError: pivot_table() got an unexpected keyword argument 'rows'
查了有關資料,將rows改成index,cols寫成全名”columns”:
In [8]: tips.pivot_table(index=['sex','smoker'])
Out[8]:
size tip total_bill
sex smoker
Female No 2.592593 2.773519 18.105185
Yes 2.242424 2.931515 17.977879
Male No 2.711340 3.113402 19.791237
Yes 2.500000 3.051167 22.284500
In [14]: tips.pivot_table(['tip_pct','size'],index = ['sex','day'],columns = 'smoker')
Out[14]:
tip_pct size
smoker No Yes No Yes
sex day
Female Fri 0.165296 0.209129 2.500000 2.000000
Sat 0.147993 0.163817 2.307692 2.200000
Sun 0.165710 0.237075 3.071429 2.500000
Thur 0.155971 0.163073 2.480000 2.428571
Male Fri 0.138005 0.144730 2.000000 2.125000
Sat 0.162132 0.139067 2.656250 2.629630
Sun 0.158291 0.173964 2.883721 2.600000
Thur 0.165706 0.164417 2.500000 2.300000
#傳入margins = True作為分項的一個匯總
In [15]: tips.pivot_table(['tip_pct','size'],index = ['sex','day'],columns = 'smoker',margins = True)
Out[15]:
tip_pct size
smoker No Yes All No Yes All
sex day
Female Fri 0.165296 0.209129 0.199388 2.500000 2.000000 2.111111
Sat 0.147993 0.163817 0.156470 2.307692 2.200000 2.250000
Sun 0.165710 0.237075 0.181569 3.071429 2.500000 2.944444
Thur 0.155971 0.163073 0.157525 2.480000 2.428571 2.468750
Male Fri 0.138005 0.144730 0.143385 2.000000 2.125000 2.100000
Sat 0.162132 0.139067 0.151577 2.656250 2.629630 2.644068
Sun 0.158291 0.173964 0.162344 2.883721 2.600000 2.810345
Thur 0.165706 0.164417 0.165276 2.500000 2.300000 2.433333
All 0.159328 0.163196 0.160803 2.668874 2.408602 2.569672
#使用len可以得到有關分組大小的交叉表
In [17]: tips.pivot_table('tip_pct',index = ['sex','smoker'],columns='day',aggfunc = len,margins = True)
Out[17]:
day Fri Sat Sun Thur All
sex smoker
Female No 2.0 13.0 14.0 25.0 54.0
Yes 7.0 15.0 4.0 7.0 33.0
Male No 2.0 32.0 43.0 20.0 97.0
Yes 8.0 27.0 15.0 10.0 60.0
All 19.0 87.0 76.0 62.0 244.0
#如果有空值NA,我們設置fill_value = 0
In [19]: tips.pivot_table('size',index=['time','sex','smoker'],columns='day',aggfunc='sum',fill_value = 0)
Out[19]:
day Fri Sat Sun Thur
time sex smoker
Dinner Female No 2 30 43 2
Yes 8 33 10 0
Male No 4 85 124 0
Yes 12 71 39 0
Lunch Female No 3 0 0 60
Yes 6 0 0 17
Male No 0 0 0 50
Yes 5 0 0 23
[圖片上傳中。。。(3)]
9.5 交叉表:crosstab
交叉表是一種計算頻率的特殊透視表。
In [20]: pd.crosstab([tips.time,tips.day],tips.smoker,margins=True)
Out[20]:
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244