Python数据分析 - PolarsBook中文版: https://www.pythondataanalysis.com/docs/polars_book_cn/ - Polars快速入门: https://www.pythondataanalysis.com/docs/polars_book_cn/quickstart/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/expressions/ - Polars上下文: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/contexts/ - Polars分组: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/groupby/ - Polars折叠: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/folds/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/custom_functions/ - Polars实例: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/introduction_polars/ - Polars表达式方法: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/api/ - Polars视频介绍: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/video_intro/ - Polars与Numpy交互: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/numpy/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/window_functions/ - Polars索引: https://www.pythondataanalysis.com/docs/polars_book_cn/indexing/ - Polars数据类型: https://www.pythondataanalysis.com/docs/polars_book_cn/datatypes/ - 来自Pandas: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_pandas/ - 来自ApacheSpark: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_spark/ - Polars性能: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/strings/ - Polars优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/ - Polars惰性方法: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/ - 谓词下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/predicate-pushdown/ - 投影下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/projection-pushdown/ - 其它优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/other-optimizations/ - Polars参考指南: https://www.pythondataanalysis.com/docs/polars_book_cn/references/ - Polars时间序列: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/ - Polars时间序列实例: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/time-series/ - Polars使用范围: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/ - IO: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/ - Polars操作CSV文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/csv/ - Polars操作Parquet文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/parquet/ - Polars处理多个文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/multiple_files/ - Polars读取数据库: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/read_db/ - Polars与AWS交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/aws/ - Polars与Google BigQuery交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/google-big-query/ - Polars与Postgres交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/postgres/ - 互通性: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/ - Arrow: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/arrow/ - Numpy: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/numpy/ - 数据: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/strings/ - 时间戳: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/timestamps/ - 数据帧: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/ - 选中行或列: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/row_col_selection/ - 常用操作: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/common-manipulations/ - 聚合: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/aggregate/ - 分组: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/groupby/ - 过滤: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/filter/ - 连接: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/join/ - 重塑: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/melt/ - 条件应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/conditionally-apply/ - 排序: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/sorting/ - 透视: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/pivot/ - 应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/udfs/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/window-functions/ - Python数据分析 第二版: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/ - 第 1 章 准备工作: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-01/ - 第 2 章 Python 语法基础: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-02/ - 第 3 章 Python 的数据结构、函数和文件: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-03/ - 第 4 章 NumPy 基础:数组和向量计算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-04/ - 第 5 章 Pandas 入门: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-05/ - 第 6 章 数据加载、存储与文件格式: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-06/ - 第 7 章 数据清洗和准备: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-07/ - 第 10 章 数据聚合与分组运算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-10/ - 第 11 章 时间序列: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-11/ - 第 12 章 pandas 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-12/ - 第 13 章 Python 建模库介绍: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-13/ - 第 14 章 数据分析案例: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-14/ - 附录 A NumPy 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-A/ - 附录 B 更多关于 IPython 的内容: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-B/ - 第 8 章 数据规整:聚合、合并和重塑: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-08/ - 第 9 章 绘图和可视化: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-09/ - Polars用户指南: https://www.pythondataanalysis.com/docs/Polars_user_guide/ - Polars入门: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_getting_started/ - 安装Polars: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_installation/ - Polars核心概念: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/ - Polars数据类型和结构: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/data-types-and-structures/ - Polars表达式和上下文: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/expressions-and-contexts/ - Polars延迟API: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/lazy-api/ - Streaming: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/_streaming/ - Polars表达式: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/ - Polars基本操作: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/basic-operations/ - Aggregation: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/aggregation/ - Casting: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/casting/ - Categorical Data and Enums: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/categorical-data-and-enums/ - Expression Expansion: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/expression-expansion/ - Folds: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/folds/ - Lists and Arrays: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/lists-and-arrays/ - Missing Data: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/missing-data/ - Numpy Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/numpy-functions/ - Strings: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/strings/ - Structs: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/structs/ - User Defined Python Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/user-defined-python-functions/ - Window Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/window-functions/ - Reference: https://www.pythondataanalysis.com/docs/Polars_user_guide/api/reference/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/contributing/ - Versioning: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/versioning/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars-cloud/ - Ecosystem: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/ecosystem/ - Gpu Support: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/gpu-support/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/io/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/lazy/ - Pandas: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/pandas/ - Spark: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/spark/ - Arrow: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/arrow/ - Comparison: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/comparison/ - Multiprocessing: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/multiprocessing/ - Polars Llms: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/polars_llms/ - Styling: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/styling/ - Visualization: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/visualization/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/plugins/ - Create: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/create/ - Cte: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/cte/ - Intro: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/intro/ - Select: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/select/ - Show: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/show/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/transformations/ # Polars窗口函数 窗口函数是一种强大的表达式。它可以让用户在 `select` 上下文中分组进行类聚。 让我们通过例子看看这是什么意思。首先,我们创建一个数据结构,这个数据包含如下列,分别代表口袋妖怪的一些信息: `['#', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']` ```python import polars as pl # 然后,让我们加载一些包pokemon信息的csv数据 df = pl.read_csv( "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv" ) ``` ```text shape: (163, 13) ┌─────┬───────────────────────┬─────────┬────────┬───┬─────────┬───────┬────────────┬───────────┐ │ # ┆ Name ┆ Type 1 ┆ Type 2 ┆ … ┆ Sp. Def ┆ Speed ┆ Generation ┆ Legendary │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ str ┆ ┆ i64 ┆ i64 ┆ i64 ┆ bool │ ╞═════╪═══════════════════════╪═════════╪════════╪═══╪═════════╪═══════╪════════════╪═══════════╡ │ 1 ┆ Bulbasaur ┆ Grass ┆ Poison ┆ … ┆ 65 ┆ 45 ┆ 1 ┆ false │ │ 2 ┆ Ivysaur ┆ Grass ┆ Poison ┆ … ┆ 80 ┆ 60 ┆ 1 ┆ false │ │ 3 ┆ Venusaur ┆ Grass ┆ Poison ┆ … ┆ 100 ┆ 80 ┆ 1 ┆ false │ │ 3 ┆ VenusaurMega Venusaur ┆ Grass ┆ Poison ┆ … ┆ 120 ┆ 80 ┆ 1 ┆ false │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ 147 ┆ Dratini ┆ Dragon ┆ null ┆ … ┆ 50 ┆ 50 ┆ 1 ┆ false │ │ 148 ┆ Dragonair ┆ Dragon ┆ null ┆ … ┆ 70 ┆ 70 ┆ 1 ┆ false │ │ 149 ┆ Dragonite ┆ Dragon ┆ Flying ┆ … ┆ 100 ┆ 80 ┆ 1 ┆ false │ │ 150 ┆ Mewtwo ┆ Psychic ┆ null ┆ … ┆ 90 ┆ 130 ┆ 1 ┆ true │ └─────┴───────────────────────┴─────────┴────────┴───┴─────────┴───────┴────────────┴───────────┘ ``` ## Groupby 类聚 下面我们看看如何用窗口函数对不同的列分组并且类聚。这样我们可以在一次查询中并行的运行多个分组操作。 类聚的结果会投射会原有的行。因此,窗口函数永远返回一个跟原有 `DataFrame` 一样规格的 `DataFrame`。 注意,我们使用了 `.over("Type 1")` 和 `.over(["Type 1", "Type 2"])`,利用窗口函数我们可以一个 `select` 语境中实现多个分组类聚。 更好的是,计算过的分组会被缓存并且在不同的窗口函数中共享。 ```python out = df.select( [ "Type 1", "Type 2", pl.col("Attack").mean().over("Type 1").alias("avg_attack_by_type"), pl.col("Defense").mean().over(["Type 1", "Type 2"]).alias("avg_defense_by_type_combination"), pl.col("Attack").mean().alias("avg_attack"), ] ) ``` ```text shape: (163, 5) ┌─────────┬────────┬────────────────────┬─────────────────────────────────┬────────────┐ │ Type 1 ┆ Type 2 ┆ avg_attack_by_type ┆ avg_defense_by_type_combination ┆ avg_attack │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ f64 ┆ f64 ┆ f64 │ ╞═════════╪════════╪════════════════════╪═════════════════════════════════╪════════════╡ │ Grass ┆ Poison ┆ 72.923077 ┆ 67.8 ┆ 75.349693 │ │ Grass ┆ Poison ┆ 72.923077 ┆ 67.8 ┆ 75.349693 │ │ Grass ┆ Poison ┆ 72.923077 ┆ 67.8 ┆ 75.349693 │ │ Grass ┆ Poison ┆ 72.923077 ┆ 67.8 ┆ 75.349693 │ │ … ┆ … ┆ … ┆ … ┆ … │ │ Dragon ┆ null ┆ 94.0 ┆ 55.0 ┆ 75.349693 │ │ Dragon ┆ null ┆ 94.0 ┆ 55.0 ┆ 75.349693 │ │ Dragon ┆ Flying ┆ 94.0 ┆ 95.0 ┆ 75.349693 │ │ Psychic ┆ null ┆ 53.875 ┆ 51.428571 ┆ 75.349693 │ └─────────┴────────┴────────────────────┴─────────────────────────────────┴────────────┘ ``` ## 分组操作 窗口函数不仅仅可以类聚,还可以用来按照组施加自定义函数。例如,如果你想要在某一组中排序,你可以: `.col("value").sort().over("group")`。 让我们试着过滤一些行: ```python filtered = df.filter(pl.col("Type 2") == "Psychic").select( [ "Name", "Type 1", "Speed", ] ) print(filtered) ``` ```text shape: (7, 3) ┌─────────────────────┬────────┬───────┐ │ Name ┆ Type 1 ┆ Speed │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞═════════════════════╪════════╪═══════╡ │ Slowpoke ┆ Water ┆ 15 │ │ Slowbro ┆ Water ┆ 30 │ │ SlowbroMega Slowbro ┆ Water ┆ 30 │ │ Exeggcute ┆ Grass ┆ 40 │ │ Exeggutor ┆ Grass ┆ 55 │ │ Starmie ┆ Water ┆ 115 │ │ Jynx ┆ Ice ┆ 95 │ └─────────────────────┴────────┴───────┘ ``` 注意到,分组 `Water` 的列 `Type 1` 并不连续,中间有两行 `Grass`。而且,同组中的每一个口袋妖股 被按照 `Speed` 升序排列。不幸的是,这个例子我们希望降序排列,幸运的是,这很简单: ```python out = filtered.with_columns( [ pl.col(["Name", "Speed"]).sort(descending=True).over("Type 1"), ] ) print(out) ``` ```text shape: (7, 3) ┌─────────────────────┬────────┬───────┐ │ Name ┆ Type 1 ┆ Speed │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞═════════════════════╪════════╪═══════╡ │ Starmie ┆ Water ┆ 115 │ │ Slowpoke ┆ Water ┆ 30 │ │ SlowbroMega Slowbro ┆ Water ┆ 30 │ │ Exeggutor ┆ Grass ┆ 55 │ │ Exeggcute ┆ Grass ┆ 40 │ │ Slowbro ┆ Water ┆ 15 │ │ Jynx ┆ Ice ┆ 95 │ └─────────────────────┴────────┴───────┘ ``` `Polars` 会追踪每个组的位置,并把相应的表达式映射到适当的行。这个操作可以在一个 select 环境中完成。 窗口函数的强大之处在于:你通常不需要 `groupby -> explode` 组合,而是把逻辑放入一个表达式中。 这也使得 API 更加简洁: - `groupby` -> 标记类聚的分组,返回一个跟组的个数一致的 `DataFrame` - `over` -> 标记我们希望对这个分组进行计算,但是不会更改原有 `DataFrame` 的形状 ## 窗口表达式的规则 窗口表达式的计算规则如下(假设我们有一个 `pl.Int32` 列): ```python # 分组内类聚且广播 # 输出类型: -> Int32 pl.sum("foo").over("groups") # 组内加和,然后乘以组内的元素 # 输出类型: -> Int32 (pl.col("x").sum() * pl.col("y")).over("groups") # 组内加和,然后乘以组内的元素 # 并且组内类聚成一个列表 # 输出类型: -> List(Int32) (pl.col("x").sum() * pl.col("y")).list().over("groups") # 注意这里需要一个显式的 `list` 调用 # 组内加和,然后乘以组内的元素 # 并且组内类聚成一个列表 # list() 会展开 # 如果组内是有序的,这是最快的操作方法: (pl.col("x").sum() * pl.col("y")).list().over("groups").flatten() ``` ## 展开窗口函数 就像刚刚的例子,如果你的窗口函数返回一个 `list`: `pl.col("Name").sort_by(pl.col("Speed")).head(3).list().over("Type 1")` 这样可以,但是这样会返回一个类型为 `List` 的列,这可能不是我们想要的,而且会增加内存使用。 这是我们可以采用 `flatten`。这个函数会把一个 2D 列表转换成 1D,然后把列投射到我们的 `DataFrame`。 这个操作非常快,因为 reshape 基本没有成本,给原有 `DataFrame` 增加列也非常快,因为我们不需要 一般窗口函数的联合(Join)操作。 但是,想要正确的使用这个操作,我们要保证用于 `over` 的列是有序的。