2019年1月 – 数据分析师的日常

前段时间看Think Python里面有句话记忆犹新，大概意思是：有时候Python让我们感到困惑，是因为实现一个效果的方法太多，而不是太少。

确实如此，Pandas的DataFrame数据选取就存在这样的问题。本来理解列表索引（了解列表索引请参考：一张图弄懂python索引和切片）就已经很困难了，DataFrame还带这么多方法。

废话少说，直接上结果。

1、loc：通过标签选取数据，即通过index和columns的值进行选取。loc方法有两个参数，按顺序控制行列选取。

#示例数据集
df=pd.DataFrame(np.arange(12).reshape(4,3),columns=list('abc'),index=list('defg'))

df
Out[189]: 
   a   b   c
d  0   1   2
e  3   4   5
f  6   7   8
g  9  10  11

#直接索引行
df.loc['d']
Out[190]: 
a    0
b    1
c    2
Name: d, dtype: int32

#索引多行
df.loc[['d','e']]
Out[191]: 
   a  b  c
d  0  1  2
e  3  4  5

#索引多列
df.loc[:,:'b']
Out[193]: 
   a   b
d  0   1
e  3   4
f  6   7
g  9  10

#如果索引的标签不在index或columns范围则会报错，a标签在列中，loc的第一个参数为行索引。
df.loc['a']
Traceback (most recent call last):
……
KeyError: 'the label [a] is not in the [index]'

2、iloc：通过行号选取数据，即通过数据所在的自然行列数为选取数据。iloc方法也有两个参数，按顺序控制行列选取。

注意：行号和索引有所差异，进行筛选后的数据行号会根据新的DataFrame变化，而索引不会发生变化。

df
Out[196]: 
   a   b   c
d  0   1   2
e  3   4   5
f  6   7   8
g  9  10  11

#选取一行
df.iloc[0]
Out[197]: 
a    0
b    1
c    2
Name: d, dtype: int32

#选取多行
df.iloc[0:2]
Out[198]: 
   a  b  c
d  0  1  2
e  3  4  5

#选取一列或多列
df.iloc[:,2:3]
Out[199]: 
    c
d   2
e   5
f   8
g  11

3、ix：混合索引，同时通过标签和行号选取数据。ix方法也有两个参数，按顺序控制行列选取。

注意：ix的两个参数中，每个参数在索引时必须保持只使用标签或行号进行数据选取，否则会返回一部分控制结果。

df
Out[200]: 
   a   b   c
d  0   1   2
e  3   4   5
f  6   7   8
g  9  10  11

#选取一行
df.ix[1]
Out[201]: 
a    3
b    4
c    5
Name: e, dtype: int32

#错误的混合索引（想选取第一行和e行）
df.ix[[0,'e']]
Out[202]: 
     a    b    c
0  NaN  NaN  NaN
e  3.0  4.0  5.0

#选取区域（e行的前两列）
df.ix['e':,:2]
Out[203]: 
   a   b
e  3   4
f  6   7
g  9  10

4、at/iat：通过标签或行号获取某个数值的具体位置。

df
Out[204]: 
   a   b   c
d  0   1   2
e  3   4   5
f  6   7   8
g  9  10  11

#获取第2行，第3列位置的数据
df.iat[1,2]
Out[205]: 5

#获取f行，a列位置的数据
df.at['f','a']
Out[206]: 6

5、直接索引 df[]

df
Out[208]: 
   a   b   c
d  0   1   2
e  3   4   5
f  6   7   8
g  9  10  11

#选取行
df[0:3]
Out[209]: 
   a  b  c
d  0  1  2
e  3  4  5
f  6  7  8

#选取列
df['a']
Out[210]: 
d    0
e    3
f    6
g    9
Name: a, dtype: int32

#选取多列
df[['a','c']]
Out[211]: 
   a   c
d  0   2
e  3   5
f  6   8
g  9  11

#行号和区间索引只能用于行（预想选取C列的数据，
#但这里选取除了df的所有数据，区间索引只能用于行，
#因defg均>c，所以所有行均被选取出来）
df['c':]
Out[212]: 
   a   b   c
d  0   1   2
e  3   4   5
f  6   7   8
g  9  10  11
df['f':]
Out[213]: 
   a   b   c
f  6   7   8
g  9  10  11

#df.选取列
df.a
Out[214]: 
d    0
e    3
f    6
g    9
Name: a, dtype: int32
#不能使用df.选择行
df.f
Traceback (most recent call last):
  File "<ipython-input-215-6438703abe20>", line 1, in <module>
    df.f
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2744, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'f'

6、总结

1）.loc,.iloc,.ix,只加第一个参数如.loc([1,2]),.iloc([2:3]),.ix[2]…则进行的是行选择
2）.loc,.at，选列是只能是列名，不能是position
3）.iloc,.iat，选列是只能是position，不能是列名
4）df[]只能进行行选择，或列选择，不能同时进行列选择，列选择只能是列名。行号和区间选择只能进行行选择。当index和columns标签值存在重复时，通过标签选择会优先返回行数据。df.只能进行列选择，不能进行行选择。

存档2019年1月3日

Pandas DataFrame的loc、iloc、ix和at/iat浅析