"上次有讲过 numpy 的基础知识，这次来讲讲 Pandas 的基础知识。 [链接]Series s1 Out[16]: 0 a 1 b ...."

luwenjun

Rpa 179 号会员
第三方库使用 • 3 回帖 • 648 浏览 • 2019-03-21 09:49:43

Pandas 类库基础知识

上次有讲过 numpy 的基础知识，这次来讲讲 Pandas 的基础知识。

Series

s1
Out[16]: 
0    a
1    b
2    c
3    d

没有指定索引时，会默认生成一个从 0 开始到 N-1 的整型索引。

Series 会根据传入的 list 序列中元素的类型判断 Series 对象的数据类型，如果全部都是整型，则创建的 Series 对象是整型，如果有一个元素是浮点型，则创建的 Series 对象是浮点型，如果有一个是字符串，则创建的 Series 对象是 object 类型。

s1 = Series([1,2,3,4])
s1
Out[23]: 
0    1
1    2
2    3
3    4
dtype: int64
s2 = Series([1,2,3,4.0])
s2
Out[25]: 
0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64
s3 = Series([1,2,3,'4'])
s3
Out[27]: 
0    1
1    2
2    3
3    4
dtype: object

除了通过 list 序列创建 Series 对象外，还可以通过 dict 创建 Series 对象。

s1 = Series({'a':1,'b':2,'c':3,'d':4})
s1
Out[37]: 
a    1
b    2
c    3
d    4
dtype: int64

通过 dict 词典创建 Series 对象时，会将词典的键初始化 Series 的 Index，而 dict 的 value 初始化 Series 的 value。

Series 还支持传入一个 dict 词典和一个 list 序列创建 Series 对象：

dict1 = {'a':1,'b':2,'c':3,'d':4}
index1 = ['a','b','e']
s1 = Series(dict1,index=index1)
s1
Out[51]: 
a    1.0
b    2.0
e    NaN
dtype: float64

上面的代码中，指定了创建的 Series 对象 s1 的索引是 index1，即’a’,‘b’和’e’。s1 的值是 dict1 中和 index1 索引相匹配的值，如果不匹配，则显示 NaN。例如索引’e’和 dict1 中的键没有相匹配的，则索引’e’的值为 NaN。索引’a’和索引’b’都匹配得上，因此值为 1 和 2。

Series 通过索引访问值：

s1 = Series({'a':1,'b':2,'c':3,'d':4})
s1
Out[39]: 
a    1
b    2
c    3
d    4
dtype: int64
s1['b']
Out[40]: 2

上面代码中通过 s1[‘b’] 就可以访问到索引 b 对应的值。

Series 支持逻辑和数学运算：

s1 = Series([2,5,-10,200])
s1 * 2
Out[53]: 
0      4
1     10
2    -20
3    400
dtype: int64
s1[s1>0]
Out[54]: 
0      2
1      5
3    200
dtype: int64

对 Series 变量做数学运算，会作用于 Series 对象中的每一个元素。

s1 = Series([2,5,-10,200])
s1[s1>0]
Out[7]: 
0      2
1      5
3    200
dtype: int64

对 Series 做逻辑运算时，会将 Series 中的值替换为 bool 类型的对象。

s1 = Series([2,5,-10,200])
s1
Out[10]: 
0      2
1      5
2    -10
3    200
dtype: int64
s1 > 0
Out[11]: 
0     True
1     True
2    False
3     True
dtype: bool

通过 series 的逻辑运算，可以过滤掉一些不符合条件的数据，例如过滤掉上面例子中小于 0 的元素：

s1 = Series([2,5,-10,200])
s1[s1 >0]
Out[23]: 
0      2
1      5
3    200
dtype: int64

Series 对象和索引都有一个 name 属性，通过下面的方法可以设置 Series 对象和索引的 name 值：

fruit = {0:'apple',1:'orange',2:'banana'} 
fruitSeries = Series(fruit)
fruitSeries.name='Fruit'
fruitSeries
Out[27]: 
0     apple
1    orange
2    banana
Name: Fruit, dtype: object
fruitSeries.index.name='Fruit Index'
fruitSeries
Out[29]: 
Fruit Index
0     apple
1    orange
2    banana
Name: Fruit, dtype: object

可以通过 index 复制方式直接修改 Series 对象的 index：

fruitSeries.index=['a','b','c']
fruitSeries
Out[31]: 
a     apple
b    orange
c    banana
Name: Fruit, dtype: object

DataFrame

DataFrame 是表格型的数据结构，和关系型数据库中的表很像，都是行和列组成，有列名，索引等属性。

我们可以认为 DataFrame 中的列其实就是上面提到的 Series，有多少列就有多少个 Series 对象，它们共享同一个索引 index。

通过 dict 字典创建 DataFrame 对象：

data = {'fruit':['Apple','Apple','Orange','Orange','Banana'],
'year':[2010,2011,2012,2011,2012],
'sale':[15000,17000,36000,24000,29000]}
frame = DataFrame(data)
frame
Out[12]: 
    fruit  year   sale
0   Apple  2010  15000
1   Apple  2011  17000
2  Orange  2012  36000
3  Orange  2011  24000
4  Banana  2012  29000

使用上面的方式创建 DataFrame 对象时，字典中每个元素的 value 值必须是列表，并且长度必须一致，如果长度不一致会报错。例如 key 为 fruit、year、sale 对应的列表长度必须一致。

创建 DataFrame 对象和会创建 Series 对象一样自动加上索引。

通过传入 columns 参数指定列的顺序：

data = {'fruit':['Apple','Apple','Orange','Orange','Banana'],
'year':[2010,2011,2012,2011,2012],
'sale':[15000,17000,36000,24000,29000]}
frame = DataFrame(data,columns=['sale','fruit','year','price'])
frame
Out[25]: 
    sale   fruit  year price
0  15000   Apple  2010   NaN
1  17000   Apple  2011   NaN
2  36000  Orange  2012   NaN
3  24000  Orange  2011   NaN
4  29000  Banana  2012   NaN

如果传入的列在数据中找不到，就会产生 NaN 值。

DataFrame 的 index 也是可以修改的，同样传入一个列表：

frame = DataFrame(data,columns=['sale','fruit','year'],index=[4,3,2,1,0])
frame
Out[22]: 
    sale   fruit  year
4  15000   Apple  2010
3  17000   Apple  2011
2  36000  Orange  2012
1  24000  Orange  2011
0  29000  Banana  2012

通过传入的 [4,3,2,1,0] 就将原来的 index 从 0,1,2,3,4 改变为 4,3,2,1,0。

通过 DataFrame 对象获取 Series 对象：

frame['year']
Out[26]: 
0    2010
1    2011
2    2012
3    2011
4    2012
Name: year, dtype: int64
frame['fruit']
Out[27]: 
0     Apple
1     Apple
2    Orange
3    Orange
4    Banana
Name: fruit, dtype: object

frame[‘fruit’] 和 frame.fruit 都可以获取列，并且返回的是 Series 对象。

DataFrame 赋值，就是对列赋值，首先获取 DataFrame 对象中某列的 Series 对象，然后通过赋值的方式就可以修改列的值：

data = {'fruit':['Apple','Apple','Orange','Orange','Banana'],
'year':[2010,2011,2012,2011,2012],
'sale':[15000,17000,36000,24000,29000]}
frame = DataFrame(data,columns=['sale','fruit','year','price'])
frame
Out[24]: 
    sale   fruit  year price
0  15000   Apple  2010   NaN
1  17000   Apple  2011   NaN
2  36000  Orange  2012   NaN
3  24000  Orange  2011   NaN
4  29000  Banana  2012   NaN
frame['price'] = 20
frame
Out[26]: 
    sale   fruit  year  price
0  15000   Apple  2010     20
1  17000   Apple  2011     20
2  36000  Orange  2012     20
3  24000  Orange  2011     20
4  29000  Banana  2012     20
frame.price = 40
frame
Out[28]: 
    sale   fruit  year  price
0  15000   Apple  2010     40
1  17000   Apple  2011     40
2  36000  Orange  2012     40
3  24000  Orange  2011     40
4  29000  Banana  2012     40
frame.price=np.arange(5)
frame
Out[30]: 
    sale   fruit  year  price
0  15000   Apple  2010      0
1  17000   Apple  2011      1
2  36000  Orange  2012      2
3  24000  Orange  2011      3
4  29000  Banana  2012      4

通过 frame[‘price’] 或者 frame.price 获取 price 列，然后通过 frame[‘price’]=20 或 frame.price=20 就可以将 price 列都赋值为 20。

也可以通过 numpy 的 arange 方法进行赋值。如上面的代码所示。

可以通过 Series 给 DataFrame 对象赋值：

data = {'fruit':['Apple','Apple','Orange','Orange','Banana'],
'year':[2010,2011,2012,2011,2012],
'sale':[15000,17000,36000,24000,29000]}
frame = DataFrame(data,columns=['sale','fruit','year','price'])
frame
Out[6]: 
    sale   fruit  year price
0  15000   Apple  2010   NaN
1  17000   Apple  2011   NaN
2  36000  Orange  2012   NaN
3  24000  Orange  2011   NaN
4  29000  Banana  2012   NaN
priceSeries = Series([3.4,4.2,2.4],index = [1,2,4])
frame.price = priceSeries
frame
Out[9]: 
    sale   fruit  year  price
0  15000   Apple  2010    NaN
1  17000   Apple  2011    3.4
2  36000  Orange  2012    4.2
3  24000  Orange  2011    NaN
4  29000  Banana  2012    2.4

这种赋值方式，DataFrame 的索引会和 Series 的索引自动匹配，在对应的索引位置赋值，匹配不上的位置将填上缺失值 NaN。

创建的 Series 对象如果不指定索引时的赋值结果：

priceSeries = Series([3.4,4.2,2.4])
frame.price = priceSeries
frame
Out[12]: 
    sale   fruit  year  price
0  15000   Apple  2010    3.4
1  17000   Apple  2011    4.2
2  36000  Orange  2012    2.4
3  24000  Orange  2011    NaN
4  29000  Banana  2012    NaN

DataFrame 还支持通过列表或者数组的方式给列赋值，但是必须保证两者的长度一致：

priceList=[3.4,2.4,4.6,3.8,7.3]
frame.price=priceList
frame
Out[15]: 
    sale   fruit  year  price
0  15000   Apple  2010    3.4
1  17000   Apple  2011    2.4
2  36000  Orange  2012    4.6
3  24000  Orange  2011    3.8
4  29000  Banana  2012    7.3
priceList=[3.4,2.4,4.6,3.8,7.3]
frame.price=priceList

赋值的列如果不存在时，相当于创建出一个新列：

frame['total'] = 30000
frame
Out[45]: 
    sale   fruit  year  price  total
0  15000   Apple  2010    3.4  30000
1  17000   Apple  2011    2.4  30000
2  36000  Orange  2012    4.6  30000
3  24000  Orange  2011    3.8  30000
4  29000  Banana  2012    7.3  30000

上面的例子通过给不存在的列赋值，新增了新列 total。必须使用 frame[‘total’] 的方式赋值，不建议使用 frame.total，使用 frame. 的方式给不存在的列赋值时，这个列会隐藏起来，直接输出 DataFrame 对象是不会看到这个 total 这个列的，但是它又真实的存在，下面的代码是分别使用 frame[‘total’] 和 frame.total 给 frame 对象的 total 列赋值，total 列开始是不存在的：

frame
Out[60]: 
    sale   fruit  year  price
0  15000   Apple  2010    3.4
1  17000   Apple  2011    2.4
2  36000  Orange  2012    4.6
3  24000  Orange  2011    3.8
4  29000  Banana  2012    7.3
frame.total = 20
frame
Out[62]: 
    sale   fruit  year  price
0  15000   Apple  2010    3.4
1  17000   Apple  2011    2.4
2  36000  Orange  2012    4.6
3  24000  Orange  2011    3.8
4  29000  Banana  2012    7.3
frame['total'] = 20
frame
Out[64]: 
    sale   fruit  year  price  total
0  15000   Apple  2010    3.4     20
1  17000   Apple  2011    2.4     20
2  36000  Orange  2012    4.6     20
3  24000  Orange  2011    3.8     20
4  29000  Banana  2012    7.3     20

使用 frame.total 方式赋值时，是看不到 total 这一列的，而用 frame[‘total’] 方式赋值时，则可以看到 total 这一列。

上面的知识有错误的地方还请大家及时提出，以便纠正。谢谢！