Pandas 2.x

呆鳥說：根據(jù) Pandas 開發(fā)團(tuán)隊(duì)發(fā)布的消息，3月以后，Pandas 就要進(jìn)入 2.x 時(shí)代了，Python 數(shù)據(jù)分析師快來入坑吧！

具體鏈接如下：https://pandas.pydata.org/docs/dev/whatsnew/index.html

Release Note

主要改進(jìn)

可配置選項(xiàng)，mode.dtype_backend 返回 pyarrow 數(shù)據(jù)類型
使用 pip 安裝可選的支持庫(kù)
Index 支持 Numpy 的 numeric 數(shù)據(jù)類型
使用 Copy_on_write（寫入時(shí)復(fù)制）機(jī)制，提高寫入性能

Python大咖談

具體說明如下：

一、加入 pyarrow 數(shù)據(jù)類型

加入對(duì) Apache Arrow 的支持，是 Pandas 2.x 最大的變化。首先介紹一下什么是 Arrow。

Arrow 是 Apache 軟件基金會(huì)支持的內(nèi)存分析開發(fā)平臺(tái)，它可以快速處理和移動(dòng)大規(guī)模數(shù)據(jù)，為數(shù)據(jù)的扁平化和分層制定了標(biāo)準(zhǔn)化的，與語言無關(guān)的列式內(nèi)存格式，以便在硬件層面上進(jìn)行更高效的數(shù)據(jù)分析操作。

pyarrow 是為 Python 社區(qū)提供的 Arrow 支持庫(kù)，與 NumPy 和 Pandas 的集成度非常高，從 2.0 版開始，Pandas 專門加入了對(duì) pyarrow 數(shù)據(jù)類型的支持。

使用 pyarrow，可以讓 pandas 處理數(shù)據(jù)的數(shù)據(jù)操作更快，內(nèi)存使用效率更高，尤其是在處理超大數(shù)據(jù)集時(shí)，其優(yōu)勢(shì)更明顯。

以下內(nèi)容是 Pandas 2.0 開發(fā)公告介紹的對(duì) arrow 的支持說明。

Pandas 之前在 read_csv()、read_excel()、read_json()、read_sql()、to_numeric() 等函數(shù)中使用 use_nullable_dtypes 關(guān)鍵字參數(shù)，讓這些函數(shù)可以自動(dòng)轉(zhuǎn)換 nullable 數(shù)據(jù)類型，為了簡(jiǎn)化操作，Pandas 新增了一個(gè) nullable_dtypes 選項(xiàng)，允許在沒有明確指定時(shí)，把關(guān)鍵字參數(shù)在全局范圍內(nèi)設(shè)為 True。啟用該選項(xiàng)的方式如下：

pd.options.mode.nullable_dtypes = True

這個(gè)選項(xiàng)僅用于函數(shù)的 use_nullable_dtypes 關(guān)鍵字。

Pandas 又新增了一個(gè)全局配置項(xiàng)： mode.dtype_backend，用于連接上述 read_csv() 等函數(shù)中的 use_nullable_dtypes=True 參數(shù)，以選擇 nullable 數(shù)據(jù)類型。

DataFrame.convert_dtypes() 和 Series.convert_dtypes() 兩種方法也可以使用
mode.dtype_backend 選項(xiàng)。

mode.dtype_backend 的默認(rèn)值為 pandas，返回的是 Numpy 支持的 nullable 數(shù)據(jù)類型。但現(xiàn)在也可以設(shè)置為 pyarrow，返回 pyarrow 支持的 nullable 數(shù)據(jù)類型，即 ArrowDtype。

示例代碼如下：

In [13]: import io

In [14]: data = io.StringIO("""a,b,c,d,e,f,g,h,i
   ....:     1,2.5,True,a,,,,,
   ....:     3,4.5,False,b,6,7.5,True,a,
   ....: """)
   ....: 

In [15]: with pd.option_context("mode.dtype_backend", "pandas"):
   ....:     df = pd.read_csv(data, use_nullable_dtypes=True)
   ....: 

In [16]: df.dtypes
Out[16]: 
a             Int64
b           Float64
c           boolean
d    string[python]
e             Int64
f           Float64
g           boolean
h    string[python]
i             Int64
dtype: object

In [17]: data.seek(0)
Out[17]: 0

# 主要看下面這行代碼
In [18]: with pd.option_context("mode.dtype_backend", "pyarrow"):
   ....:     df_pyarrow = pd.read_csv(data, use_nullable_dtypes=True, engine="pyarrow")
   ....: 

In [19]: df_pyarrow.dtypes
Out[19]: 
a     int64[pyarrow]
b    double[pyarrow]
c      bool[pyarrow]
d    string[pyarrow]
e     int64[pyarrow]
f    double[pyarrow]
g      bool[pyarrow]
h    string[pyarrow]
i      null[pyarrow]
dtype: object

二、使用 pip 安裝可選的支持庫(kù)

使用 pip 安裝 pandas 時(shí)，可以指定要安裝的可選支持庫(kù)。

pip install "pandas[performance, aws]>=2.0.0"

三、`Index` 支持 Numpy 的 numeric 數(shù)據(jù)類型

Pandas 2.0 開始，可以在 Index 中使用 numpy 的數(shù)字型數(shù)據(jù)類型。Pandas 之前只能用 int64、 uint64 和 float64 等數(shù)據(jù)類型，從 2.0 開始，Pandas 支持所有 numpy 的 numeric 數(shù)據(jù)，如 int8、int16、int32、int64、uint8、uint16、uint32、uint64、float32、float64 等。

示例代碼如下：

In [1]: pd.Index([1, 2, 3], dtype=np.int8)
Out[1]: Index([1, 2, 3], dtype='int8')

In [2]: pd.Index([1, 2, 3], dtype=np.uint16)
Out[2]: Index([1, 2, 3], dtype='uint16')

In [3]: pd.Index([1, 2, 3], dtype=np.float32)
Out[3]: Index([1.0, 2.0, 3.0], dtype='float32')

四、提高寫入性能

為以下方法新增了惰性復(fù)制機(jī)制，推遲復(fù)制，直到修改相關(guān)對(duì)象時(shí)才真正復(fù)制。啟用 Copy-on-Write 機(jī)制之后，以下方法僅返回視圖，這比常規(guī)的性能有了顯著提升。

（以下僅為部分支持該機(jī)制的方法，詳見文檔）
* DataFrame.reset_index() / Series.reset_index()
* DataFrame.set_index()
* DataFrame.reindex() / Series.reindex()
* DataFrame.reindex_like() / Series.reindex_like()
* DataFrame.drop()
* DataFrame.dropna() / Series.dropna()
* DataFrame.select_dtypes()
* DataFrame.align() / Series.align()
* Series.to_frame()
* DataFrame.rename() / Series.rename()
* DataFrame.add_prefix() / Series.add_prefix()
* DataFrame.add_suffix() / Series.add_suffix()
* DataFrame.drop_duplicates() / Series.drop_duplicates()
* DataFrame.filter() / Series.filter()
* DataFrame.head() / Series.head()
* DataFrame.tail() / Series.tail()
* DataFrame.pop() / Series.pop()
* DataFrame.replace() / Series.replace()
* DataFrame.shift() / Series.shift()
* DataFrame.sort_index() / Series.sort_index()
* DataFrame.sort_values() / Series.sort_values()
* DataFrame.truncate()
* DataFrame.iterrows()
* DataFrame.fillna() / Series.fillna()
* DataFrame.where() / Series.where()
* DataFrame.astype() / Series.astype()
* concat()

以 Series 的形式處理 DataFrame 的單個(gè)列（例如，df["col"]）時(shí)，每次構(gòu)建都返回一個(gè)新對(duì)象，啟用 Copy-on-Write 時(shí)，不再多次返回相同的 Series 對(duì)象。
使用已有的 Series 構(gòu)建 Series，且默認(rèn)選項(xiàng)為 copy=False 時(shí)，Series 構(gòu)造函數(shù)將使用惰性復(fù)制機(jī)制，即推遲復(fù)制，直到發(fā)生數(shù)據(jù)修改時(shí)才真正復(fù)制。
使用已有的 DataFrame 構(gòu)建 DataFrame，且默認(rèn)選項(xiàng)為 copy=False 時(shí)，DataFrame 構(gòu)造函數(shù)也使用惰性復(fù)制機(jī)制。
使用 Series 字典構(gòu)建 DataFrame，且默認(rèn)選項(xiàng)為 copy=False 時(shí)，也使用惰性復(fù)制機(jī)制。
啟用 Copy-on-Write 時(shí)，使用鏈?zhǔn)劫x值設(shè)置值（例如，df["a"][1:3] = 0）將引發(fā)異常。在此模式下，鏈?zhǔn)劫x值不能正常運(yùn)行。
DataFrame.replace() 在 inplace=True 時(shí)，使用 Copy-on-Write。
DataFrame.transpose() 使用 Copy-on-Write 機(jī)制。
算術(shù)運(yùn)算，如, ser *= 2 也支持 Copy-on-Write。
啟用本選項(xiàng)的方式如下：

# 方式一
pd.set_option("mode.copy_on_write", True)

# 方式二
pd.options.mode.copy_on_write = True

# 局部啟用的方式
with pd.option_context("mode.copy_on_write", True):
    ...

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

重磅消息：Pandas 2.x 即將來襲

重磅消息：Pandas 2.x 即將來襲

主要改進(jìn)

一、加入 pyarrow 數(shù)據(jù)類型

二、使用 pip 安裝可選的支持庫(kù)

三、`Index` 支持 Numpy 的 numeric 數(shù)據(jù)類型

四、提高寫入性能

推薦書單

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

重磅消息：Pandas 2.x 即將來襲

主要改進(jìn)

一、加入 pyarrow 數(shù)據(jù)類型

二、使用 pip 安裝可選的支持庫(kù)

三、Index 支持 Numpy 的 numeric 數(shù)據(jù)類型

四、提高寫入性能

推薦書單

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

三、`Index` 支持 Numpy 的 numeric 數(shù)據(jù)類型