公眾號:尤而小屋
作者:Peter
編輯:Peter
大家好,我是Peter~
本文主要介紹的是通過使用Pandas中3個字符串相關函數來篩選滿足需求的文本數據:
- contains :包含某個字符
- startswith:以字符開頭
- endswith:以字符結尾
模擬數據
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name":["xiao ming","Xiao zhang",np.nan,"sun quan","guan yu"],
"age":["22","19","20","34","39"],
"sex":["male","Female","female","Female","male"],
"address":["廣東省深圳市","浙江省杭州市","江蘇省蘇州市","福建省泉州市","廣東省廣州市"]
})
df
df.dtypes # 查看字段類型
name object
age object
sex object
address object
dtype: object
在本次模擬的數據中,有4個特點:
- name字段:存在缺失值np.nan,且Xiao和xiao存在大小寫之分
- age:年齡字段,正常應該是數值型,模擬的數據是字符類型object
- sex:也存在F和f的大小寫之分
- address:正常寫法
數據類型轉換
我們將age字段的字符類型型轉成數值型
df["age"] = df["age"].astype(float)
df
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
生成的數據如下,似乎和原始數據沒有區別;但是我們查看屬性字段的數據類型就會看到區別:
df.dtypes
name object
age float64
sex object
address object
dtype: object
age字段已經轉成了float64位的數值型。
contains
contains是用于Series數據的函數,基本語法如下:
Series.str.contains(
pat,
case=True,
flags=0,
na=None,
regex=True
)
- pat:傳入的字符或者正則表達式
- case:是否區分大小寫(對大小寫敏感)
- flags:正則標志位,比如:re.IGNORECASE,表示忽略大小寫
- na:可選項,標量類型;對原數據中的缺失值處理,如果是object-dtype, 使用numpy.nan 代替;如果是StringDtype, 用pandas.NA
- regex:布爾值;True:傳入的pat看做是正則表達式,False:看做是正常的字符類型的表達式
默認情況
# 例子1:篩選包含xiao的數據
df["name"].str.contains("xiao")
0 True
1 False
2 NaN
3 False
4 False
Name: name, dtype: object
當屬性中存在缺失值的時候,需要帶上na參數:
缺失值處理
# 例子2:參數na使用
df[df["name"].str.contains("xiao",na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
</tbody>
</table>
</div>
如果不帶上則會報錯:
df[df["name"].str.contains("xiao")]
忽略大小寫
# 例子3:case使用
df["name"].str.contains("xiao",case=False)
0 True
1 True
2 NaN
3 False
4 False
Name: name, dtype: object
上面的結果直接忽略了大小寫,可以看到出現了兩個True:也就是xiao和Xiao的數據都被篩選出來:
df[df["name"].str.contains("xiao",case=False, na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
</tbody>
</table>
</div>
忽略大小寫和缺失值
# 例子4:忽略大小寫和缺失值
df[df["sex"].str.contains("f",case=False, na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
</tbody>
</table>
</div>
正則表達式使用
# 例子5:正則表達式使用
df["address"].str.contains("^廣")
0 True
1 False
2 False
3 False
4 True
Name: address, dtype: bool
其中^
表示開始的符號,即:以廣
開頭的數據
df[df["address"].str.contains("^廣")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>
</div>
正則表達式中的$
表示結尾的符號;下面是篩選以市
結尾的數據:
df[df["address"].str.contains("市$")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>
</div>
在下面的正則表達式例子中,會在深蘇泉
中任意選擇一個,然后包含這個字符的數據:
df[df["address"].str.contains("[深蘇泉]")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
</tbody>
</table>
</div>
startswith
startswith的語法相對簡單:
Series.str.startswith(pat, na=None)
- pat:表示一個字符;注意:不接受正則表達式
- na:表示對缺失值的處理;na=False表示忽略缺失值
pat參數
指定一個字符;不接受正則表達式
df["address"].str.startswith("廣")
0 True
1 False
2 False
3 False
4 True
Name: address, dtype: bool
df[df["address"].str.startswith("廣")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>
</div>
這種寫法和正則表達式的以某個字符開頭是同樣的效果:
df[df["address"].str.contains("^廣")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>
</div>
自動區分大小寫
startswith方法是自動區分大小寫的:
df[df["sex"].str.startswith("f")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
</tbody>
</table>
</div>
df[df["sex"].str.startswith("F")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
</tbody>
</table>
</div>
缺失值處理
df["name"].str.startswith("xiao")
0 True
1 False
2 NaN
3 False
4 False
Name: name, dtype: object
df[df["name"].str.startswith("xiao",na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
</tbody>
</table>
</div>
endswith
指定以某個字符結尾,語法為:
Series.str.endswith(pat, na=None)
- pat:表示一個字符;注意:不接受正則表達式
- na:表示對缺失值的處理;na=False表示忽略缺失值
pat參數
# 以市結尾
df[df["address"].str.endswith("市")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>
</div>
# 正則的寫法:contains方法
df[df["address"].str.contains("市$")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>
</div>
缺失值處理
df["name"].str.endswith("g")
0 True
1 True
2 NaN
3 False
4 False
Name: name, dtype: object
df[df["name"].str.endswith("g",na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
</tbody>
</table>
</div>
# 不加na參數則報錯
df[df["name"].str.endswith("g")]
報錯的原因很明顯:就是因為name字段下面存在缺失值。當使用了na參數就可以解決