第四章:Pandas

木婉清2023/12/03

安装 Pandas 库

# 执行以下命令
pip install pandas

Pandas 系列

概述

Pandas Series 类似表格中的一个列( column ),类似于一维数组,可以保存任何数据类型。

  • Series 由索引( index )和列组成,函数如下
pandas.Series( data, index, dtype, name, copy)
参数描述
data一组数据( ndarray 类型)
index数据索引标签,如果不指定,默认从 0 开始
dtype数据类型,默认会自己判断
name设置名称
copy拷贝数据,默认为 False

从 List 创建 Series

import pandas as pd

s1 = pd.Series([100, 23, "bugingcode"])
print(s1)
运行结果

0 100 1 23 2 bugingcode dtype: object

在 Series 中添加相应索引

import pandas as pd
import numpy as np

# np.random.randn 是 NumPy 库中的一个函数,用于生成服从标准正态分布(均值为0,标准差为1)的随机数。
s1 = pd.Series(np.random.randn(365), index=np.arange(1, 366))
print(s1)
运行结果

1 -0.943966 2 1.269560 3 0.536271 4 -0.842864 5 -0.524680 ...
361 -0.994169 362 -0.076934 363 -1.814863 364 -1.778082 365 -0.788316 Length: 365, dtype: float64

import pandas as pd
import numpy as np

s1 = pd.Series([3, 5, 6, 8, 9, 2], index=["a", "b", "c", "d", "e", "f"])
print(s1)
运行结果

a 3 b 5 c 6 d 8 e 9 f 2 dtype: int64

创建一个空系列

import pandas as pd

s = pd.Series()
print(s)
运行结果

Series([], dtype: object)

从 ndarray 创建一个系列

import pandas as pd
import numpy as np

data = np.array(["a", "b", "c", "d"])
s = pd.Series(data)
print(s)
运行结果

0 a 1 b 2 c 3 d dtype: object

从字典创建一个系列

import pandas as pd
import numpy as np

data = {"a": 0.0, "b": 1.0, "c": 2.0}
s = pd.Series(data)
print(s)
运行结果

a 0.0 b 1.0 c 2.0 dtype: float64

import pandas as pd
import numpy as np

dic = {"m": 4, "n": 5, "p": 6}
ind = ["m", "n", "p", "a"]
s = pd.Series(dic, index=ind)
print(s)
运行结果

m 4.0 n 5.0 p 6.0 a NaN dtype: float64

从标量创建一个系列

注意

如果数据是标量值,就必须提供索引。

将重复该值以匹配索引的长度。

import pandas as pd
import numpy as np

s = pd.Series(5, index=[0, 1, 2, 3])
print(s)
运行结果

0 5 1 5 2 5 3 5 dtype: int64

从具有位置的系列中访问数据

import pandas as pd
import numpy as np

s = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"])
print(s[0])
运行结果

1

访问数据时,s[0] 尽量换成 s.iloc[0]

使用标签检索数据(索引)

import pandas as pd
import numpy as np

s = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"])
print(s["a"])
运行结果

1

对 Series 进行运算

import pandas as pd
import numpy as np

s = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"])
print(s[s > 3])
print(s * 2)
运行结果

d 4 e 5 dtype: int64

d 4 e 5 dtype: int64 a 2 b 4 c 6 d 8 e 10 dtype: int64

Pandas 数据帧

概述

  • DataFrame 是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔型值)。
  • DataFrame 既有行索引也有列索引,它可以被看做由 Series 组成的字典(共同用一个索引)。

DataFrame 构造方法如下:

pandas.DataFrame( data, index, columns, dtype, copy)
参数描述
data一组数据(ndarray、series, map, lists, dict 等类型)
index索引值,或者可以称为行标签
columns列标签,默认为 RangeIndex (0, 1, 2, …, n)
dtype数据类型
copy拷贝数据,默认为 False

创建空数据帧

import pandas as pd

df = pd.DataFrame()
print(df)
运行结果

Empty DataFrame Columns: [] Index: []

从列表创建数据帧

import pandas as pd

data = [1, 2, 3, 4, 5]
df = pd.DataFrame(data)
print(df)
运行结果

0 0 1 1 2 2 3 3 4 4 5

从 ndarray / Lists 的字典创建数据帧

  • 使用 ndarrays 创建,ndarray 的长度必须相同, 如果传递了 index,则索引的长度应等于数组的长度
import pandas as pd

data = {"Name": ["Tom", "Jack", "Steve", "Ricky"], "Age": [28, 34, 29, 42]}
df = pd.DataFrame(data)
print(df)
运行结果

​ Name Age

0 Tom 28 1 Jack 34 2 Steve 29 3 Ricky 42

小贴士

  1. 指定列索引,则会按照指定顺序排列
  2. 在指定列索引时,如果存在不匹配的列,则不匹配的列的值为 NaN
  3. 可以同时指定行索引 columns 和列索引 index

从字典列表创建数据帧

import pandas as pd

data = {"Name": ["Tom", "Jack", "Steve", "Ricky"], "Age": [28, 34, 29, 42]}
df = pd.DataFrame(data)
print(df)
运行结果

​ Name Age

0 Tom 28 1 Jack 34 2 Steve 29 3 Ricky 42

注意:index 是默认索引

从系列的字典创建数据帧

  • 使用字典(key/value),其中字典的 key 为列名
import pandas as pd

d = {
    "one": pd.Series([1, 2, 3], index=["a", "b", "c"]),
    "two": pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
print(df)
运行结果

one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4

注意:观察 one 列没有传递标签 d 故结果值为 NaN

列选择

import pandas as pd

d = {
    "one": pd.Series([1, 2, 3], index=["a", "b", "c"]),
    "two": pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
print(df["one"])
运行结果

a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64

行选择

  • 可以通过将行标签传递给 loc() 数来选择行
import pandas as pd

d = {
    "one": pd.Series([1, 2, 3], index=["a", "b", "c"]),
    "two": pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
print(df.loc["b"])
运行结果

one 2.0 two 2.0 Name: b, dtype: float64

Pandas 面板

注意

Pandas 0.25 版本后, Panel 结构已经被废弃。

pandas.Panel()

pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
参数名称描述
data输入数据,可以是 ndarray,Series,列表,字典,或者 DataFrame。
itemsaxis=0
major_axisaxis=1
minor_axisaxis=2
dtype每一列的数据类型。
copy默认为 False,表示是否复制数据。

Pandas 快速入门

对象创建

  • 通过传递值列表来创建一个系列,让Pandas创建一个默认的整数索引。
import pandas as pd
import numpy as np

s = pd.Series([1, 3, 4, np.nan, 6, 8])
print(s)
运行结果

0 1.0 1 3.0 2 4.0 3 NaN 4 6.0 5 8.0 dtype: float64

  • 通过传递 NumPy 数组,使用 datetime 索引和标记列来创建数据帧
import pandas as pd
import numpy as np

dates = pd.date_range("20180101", periods=7)
df = pd.DataFrame(np.random.randn(7, 4), index=dates, columns=list("ABCD"))
print(df.head())
print("-------" * 10)
print(df.tail(3))
运行结果

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07'], dtype='datetime64[ns]', freq='D')

​ A B C D

2018-01-01 1.069910 0.854764 1.193509 0.462927 2018-01-02 1.596742 1.756313 -1.011863 0.934186 2018-01-03 -0.616087 -0.279304 1.126725 -0.910585 2018-01-04 -0.476239 2.689675 -1.765862 0.944808 2018-01-05 0.456227 1.680538 0.480604 -0.951232 2018-01-06 -1.206513 0.985385 0.634503 0.674596 2018-01-07 1.418392 0.259349 0.824261 0.636163

  • 通过传递可以转换为类似系列的对象的字典来创建数据帧,参考以下示例代码:
import pandas as pd
import numpy as np

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20181112"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
print(df2)
运行结果

​ A B C D E F

0 1.0 2018-11-12 1.0 3 test foo 1 1.0 2018-11-12 1.0 3 train foo 2 1.0 2018-11-12 1.0 3 test foo 3 1.0 2018-11-12 1.0 3 train foo

查看数据

查看顶部 / 底部数据

import pandas as pd
import numpy as np

dates = pd.date_range("20180101", periods=7)
df = pd.DataFrame(np.random.randn(7, 4), index=dates, columns=list("ABCD"))
print(df.head())
print("-------" * 10)
print(df.tail(3))
运行结果

​ A B C D

2018-01-01 0.890861 -0.095966 0.976528 0.384346 2018-01-02 0.466986 -0.370882 0.022915 -0.531815 2018-01-03 -0.336357 -0.636763 -0.337240 0.753396 2018-01-04 -0.551515 -0.714814 -0.707550 1.429130

2018-01-05 -0.118609 1.763902 1.748892 2.389425

​ A B C D

2018-01-05 -0.118609 1.763902 1.748892 2.389425 2018-01-06 -2.974731 0.588835 -0.089476 -2.200365 2018-01-07 0.051344 -0.246723 -1.493539 -0.765806

显示索引、列、底层 NumPy 数据

import pandas as pd
import numpy as np

dates = pd.date_range("20180101", periods=7)
df = pd.DataFrame(np.random.randn(7, 4), index=dates, columns=list("ABCD"))
print(f"index is : \n{df.index}")
print(f"columns is : {df.columns}")
print(f"values is : \n{df.values}")
运行结果

index is : DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07'], dtype='datetime64[ns]', freq='D') columns is : Index(['A', 'B', 'C', 'D'], dtype='object') values is : [[-0.01048061 -1.43834792 0.54901235 -1.68665799] [-1.36649057 0.58630591 0.29951945 -0.15584442] [ 1.2318199 -0.29754522 -1.58375215 1.49606666] [ 0.16073564 -0.18260582 -0.29160011 0.92872002] [-0.49215117 -0.64984877 0.0473542 -0.76075159] [-0.54376943 -0.64886347 0.92307105 -0.5876006 ] [ 0.20723365 -1.14494873 0.24441887 -0.93413777]]

显示数据的快速统计摘要

import pandas as pd
import numpy as np

dates = pd.date_range("20180101", periods=7)
df = pd.DataFrame(np.random.randn(7, 4), index=dates, columns=list("ABCD"))
print(df.describe())
运行结果

​ A B C D

count 7.000000 7.000000 7.000000 7.000000 mean -0.582948 -0.365291 0.400608 0.040816 std 0.873102 0.418553 1.158742 0.837500 min -2.363027 -1.036569 -1.096581 -1.053369 25% -0.587280 -0.601736 -0.485705 -0.521208 50% -0.474676 -0.234526 0.659282 -0.201210 75% -0.290346 -0.048793 1.124037 0.733014 max 0.512316 0.015117 1.964890 1.116682

调换数据

import pandas as pd
import numpy as np

dates = pd.date_range("20180101", periods=7)
df = pd.DataFrame(np.random.randn(7, 4), index=dates, columns=list("ABCD"))
print(df.T)
运行结果

​ 2018-01-01 2018-01-02 2018-01-03 2018-01-04 2018-01-05 2018-01-06 2018-01-07 A -0.871924 -0.864058 -0.589633 -1.007141 0.368657 0.656307 -0.269570 B -1.217285 1.537158 -0.832103 1.534645 0.972108 0.735559 -1.469628 C 1.294199 -1.003429 2.213230 -0.757548 2.172182 -1.719724 0.579202 D 1.082225 1.063930 -1.103883 0.601972 -1.754392 0.827841 -0.876152

通过轴排序

import pandas as pd
import numpy as np

dates = pd.date_range("20180101", periods=7)
df = pd.DataFrame(np.random.randn(7, 4), index=dates, columns=list("ABCD"))
print(df.sort_index(axis=1, ascending=False))
运行结果

​ D C B A

2018-01-01 -0.561705 0.799202 1.805841 -0.370994 2018-01-02 -1.046135 -0.648496 0.377619 0.536284 2018-01-03 -0.741736 -1.456914 1.141683 1.075007 2018-01-04 0.107061 -0.265725 0.309028 -0.349821 2018-01-05 0.309145 0.994697 0.961910 0.936340 2018-01-06 -0.169231 -0.626054 -0.573925 0.087397 2018-01-07 0.813329 1.120667 0.452052 0.559828

按值排序

import pandas as pd
import numpy as np

dates = pd.date_range("20180101", periods=7)
df = pd.DataFrame(np.random.randn(7, 4), index=dates, columns=list("ABCD"))
print(df.sort_values(by="B"))
运行结果

​ A B C D

2018-01-04 0.341471 -1.002581 0.722256 -0.081194 2018-01-07 -0.078615 -0.431273 -1.532414 1.393100 2018-01-03 -0.024684 -0.340698 -0.147213 0.150270 2018-01-01 0.025371 0.114132 -0.198946 0.649376 2018-01-05 -1.489747 0.221714 0.973931 1.283776 2018-01-02 0.796899 0.499017 0.296519 -0.356109 2018-01-06 -0.177637 1.185389 0.070776 0.617686

选择区块

获取

  • 选择一列,产生一个系列,相当于 df.A
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df["A"])
运行结果

2018-11-16 0.374794 2018-11-17 0.890922 2018-11-18 -1.131266 2018-11-19 -0.262745 2018-11-20 0.867241 2018-11-21 -0.102810 Freq: D, Name: A, dtype: float64

按标签选择

  • 通过 [] 操作符选择切片行
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df[0:3])
print("=========指定选择日期========")
print(df["20181116":"20181117"])
运行结果
               A         B         C         D

2018-11-16 1.161992 -0.233227 -0.646037 -0.237566 2018-11-17 -1.337365 -1.528861 0.100122 -0.052697 2018-11-18 -0.099463 0.610006 2.079155 -0.230520 =========指定选择日期======== A B C D 2018-11-16 1.161992 -0.233227 -0.646037 -0.237566 2018-11-17 -1.337365 -1.528861 0.100122 -0.052697

  • 使用标签获取横截面
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.loc[dates[0]])
运行结果

A 0.586150 B -0.814260 C -0.199169 D 2.723732 Name: 2018-11-16 00:00:00, dtype: float64

  • 通过标签选择多轴
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.loc[:, ["A", "B"]])
运行结果

​ A B

2018-11-16 0.765901 -0.091385 2018-11-17 -0.767895 -0.067514 2018-11-18 1.674256 0.558479 2018-11-19 -1.152138 0.171175 2018-11-20 -0.557931 -1.660133 2018-11-21 2.550183 0.364767

  • 显示切片,包括两个端点
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.loc["20181116":"20181117", ["A", "B"]])
运行结果

​ A B

2018-11-16 0.100581 1.408518 2018-11-17 0.423212 2.228801

  • 减少返回对象的尺寸(大小)
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.loc["20181116", ["A", "B"]])
运行结果

A 0.767234 B -1.035906 Name: 2018-11-16 00:00:00, dtype: float64

  • 获得标量值
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.loc[dates[0], "A"])
运行结果

-1.7717011036335553

  • 快速访问标量
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.at[dates[0], "A"])
运行结果

0.019377397464371088

通过位置选择

  • 通过传递的整数的位置选择
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.iloc[3])
运行结果

A 0.003795 B 0.580251 C 0.224530 D -0.605736 Name: 2018-11-19 00:00:00, dtype: float64

  • 通过整数切片
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.iloc[3:5, 0:2])
运行结果

​ A B

2018-11-19 0.234969 0.498381 2018-11-20 0.098810 -0.910606

  • 整数位置的列表
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.iloc[[1, 2, 4], [0, 2]])
运行结果

​ A C

2018-11-17 1.506590 0.724254 2018-11-18 -0.058624 0.318910 2018-11-20 1.800680 0.032987

  • 明确切片行
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.iloc[1:3, :])
运行结果

​ A B C D

2018-11-17 2.575051 0.003290 0.052390 -0.635455 2018-11-18 0.697691 -0.827148 1.219686 -0.679870

  • 明确切片列
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.iloc[:, 1:3])
运行结果

​ B C

2018-11-16 1.921402 1.256070 2018-11-17 1.002392 -1.530565 2018-11-18 1.251472 -1.691373 2018-11-19 -1.213741 1.082248 2018-11-20 -0.058838 1.306634 2018-11-21 -0.086943 -1.335090

  • 明确获取值
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.iloc[1, 1])
运行结果

0.9102065828423043

  • 快速访问标量
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df.iat[1, 1])
运行结果

-0.5097232015773887

布尔索引

  • 使用单列的值来访问数据
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df[df.A > 0])
运行结果

​ A B C D

2018-11-16 0.196343 0.018981 -0.638423 2.387126 2018-11-17 0.093751 0.819993 1.647942 -1.827679 2018-11-19 1.322672 0.376032 -0.299328 1.466714 2018-11-20 0.485520 0.076474 -0.680757 0.084045 2018-11-21 0.280010 0.826607 1.253195 -0.475522

  • 从满足布尔条件的数据帧中选择值
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df[df > 0])
运行结果

​ A B C D

2018-11-16 0.145434 NaN 2.240250 NaN 2018-11-17 0.209507 0.050748 0.737113 0.236736 2018-11-18 1.055839 0.117194 1.024828 NaN 2018-11-19 1.306796 NaN 1.770114 0.196997 2018-11-20 NaN NaN NaN NaN 2018-11-21 NaN 1.453985 NaN 1.505579

  • 使用 isin() 方法进行过滤
import pandas as pd
import numpy as np

dates = pd.date_range("20181116", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
print(df2)
print("=============start to filter ===============")
print(df2[df2["E"].isin(["two", "four"])])
运行结果

​ A B C D E

2018-11-16 -0.119814 -0.066240 -0.268556 0.770150 one 2018-11-17 -0.429506 0.299724 -0.219061 1.481897 one 2018-11-18 -0.225174 -1.102308 -1.695428 1.168008 two 2018-11-19 1.447338 -0.082607 0.224933 -0.655493 three 2018-11-20 0.373939 0.903781 -0.466671 0.210134 four 2018-11-21 -1.577789 -0.537294 -0.596788 0.416873 three =============start to filter =============== A B C D E 2018-11-18 -0.225174 -1.102308 -1.695428 1.168008 two 2018-11-20 0.373939 0.903781 -0.466671 0.210134 four

编辑于 2023/12/5 10:59:58