Python数据分析 - PolarsBook中文版: https://www.pythondataanalysis.com/docs/polars_book_cn/ - Polars快速入门: https://www.pythondataanalysis.com/docs/polars_book_cn/quickstart/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/expressions/ - Polars上下文: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/contexts/ - Polars分组: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/groupby/ - Polars折叠: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/folds/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/custom_functions/ - Polars实例: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/introduction_polars/ - Polars表达式方法: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/api/ - Polars视频介绍: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/video_intro/ - Polars与Numpy交互: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/numpy/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/window_functions/ - Polars索引: https://www.pythondataanalysis.com/docs/polars_book_cn/indexing/ - Polars数据类型: https://www.pythondataanalysis.com/docs/polars_book_cn/datatypes/ - 来自Pandas: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_pandas/ - 来自ApacheSpark: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_spark/ - Polars性能: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/strings/ - Polars优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/ - Polars惰性方法: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/ - 谓词下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/predicate-pushdown/ - 投影下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/projection-pushdown/ - 其它优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/other-optimizations/ - Polars参考指南: https://www.pythondataanalysis.com/docs/polars_book_cn/references/ - Polars时间序列: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/ - Polars时间序列实例: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/time-series/ - Polars使用范围: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/ - IO: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/ - Polars操作CSV文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/csv/ - Polars操作Parquet文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/parquet/ - Polars处理多个文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/multiple_files/ - Polars读取数据库: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/read_db/ - Polars与AWS交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/aws/ - Polars与Google BigQuery交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/google-big-query/ - Polars与Postgres交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/postgres/ - 互通性: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/ - Arrow: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/arrow/ - Numpy: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/numpy/ - 数据: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/strings/ - 时间戳: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/timestamps/ - 数据帧: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/ - 选中行或列: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/row_col_selection/ - 常用操作: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/common-manipulations/ - 聚合: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/aggregate/ - 分组: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/groupby/ - 过滤: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/filter/ - 连接: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/join/ - 重塑: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/melt/ - 条件应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/conditionally-apply/ - 排序: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/sorting/ - 透视: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/pivot/ - 应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/udfs/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/window-functions/ - Python数据分析 第二版: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/ - 第 1 章 准备工作: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-01/ - 第 2 章 Python 语法基础: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-02/ - 第 3 章 Python 的数据结构、函数和文件: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-03/ - 第 4 章 NumPy 基础:数组和向量计算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-04/ - 第 5 章 Pandas 入门: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-05/ - 第 6 章 数据加载、存储与文件格式: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-06/ - 第 7 章 数据清洗和准备: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-07/ - 第 10 章 数据聚合与分组运算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-10/ - 第 11 章 时间序列: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-11/ - 第 12 章 pandas 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-12/ - 第 13 章 Python 建模库介绍: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-13/ - 第 14 章 数据分析案例: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-14/ - 附录 A NumPy 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-A/ - 附录 B 更多关于 IPython 的内容: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-B/ - 第 8 章 数据规整:聚合、合并和重塑: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-08/ - 第 9 章 绘图和可视化: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-09/ - Polars用户指南: https://www.pythondataanalysis.com/docs/Polars_user_guide/ - Polars入门: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_getting_started/ - 安装Polars: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_installation/ - Polars核心概念: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/ - Polars数据类型和结构: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/data-types-and-structures/ - Polars表达式和上下文: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/expressions-and-contexts/ - Polars延迟API: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/lazy-api/ - Streaming: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/_streaming/ - Polars表达式: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/ - Polars基本操作: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/basic-operations/ - Aggregation: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/aggregation/ - Casting: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/casting/ - Categorical Data and Enums: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/categorical-data-and-enums/ - Expression Expansion: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/expression-expansion/ - Folds: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/folds/ - Lists and Arrays: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/lists-and-arrays/ - Missing Data: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/missing-data/ - Numpy Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/numpy-functions/ - Strings: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/strings/ - Structs: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/structs/ - User Defined Python Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/user-defined-python-functions/ - Window Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/window-functions/ - Reference: https://www.pythondataanalysis.com/docs/Polars_user_guide/api/reference/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/contributing/ - Versioning: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/versioning/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars-cloud/ - Ecosystem: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/ecosystem/ - Gpu Support: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/gpu-support/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/io/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/lazy/ - Pandas: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/pandas/ - Spark: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/spark/ - Arrow: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/arrow/ - Comparison: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/comparison/ - Multiprocessing: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/multiprocessing/ - Polars Llms: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/polars_llms/ - Styling: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/styling/ - Visualization: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/visualization/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/plugins/ - Create: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/create/ - Cte: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/cte/ - Intro: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/intro/ - Select: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/select/ - Show: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/show/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/transformations/ # Polars 表达式 下面是一个表达式: `pl.col("foo").sort().head(2)` 这个表达式的意思是: 1. 选择 `foo` 列 1. 给 `foo` 排序 1. 然后取排序后的前两个值 表达式的强大之处在于:每一个表达式都会生成一个新的表达式,他们可以被串在一起。 你也可以把多个表达式放入一个 `Polars` 的执行上下文中。 比如,下面我们通过 `df.select` 将两个表达式放在同一个执行上下文中: ```python df.select([ pl.col("foo").sort().head(2), pl.col("bar").filter(pl.col("foo") == 1).sum() ]) ``` 这里的两个表达式是并行执行的,这就意味着 `Polars` 表达式可以**易并行计算**(即无通讯并行)。 值得注意的是,每一个表达式的执行可能同时存在更多的并行。 ## 表达式举例 这一小节我们通过例子了解表达式。首先,创建一个数据集: ```python import polars as pl import numpy as np np.random.seed(12) # 设置随机数种子(保证每次生成的随机数相同) df = pl.DataFrame( { "nrs": [1, 2, 3, None, 5], "names": ["foo", "ham", "spam", "egg", None], "random": np.random.rand(5), "groups": ["A", "A", "B", "C", "B"], } ) print(df) ``` ```text shape: (5, 4) ┌──────┬───────┬──────────┬────────┐ │ nrs ┆ names ┆ random ┆ groups │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str │ ╞══════╪═══════╪══════════╪════════╡ │ 1 ┆ foo ┆ 0.154163 ┆ A │ │ 2 ┆ ham ┆ 0.74005 ┆ A │ │ 3 ┆ spam ┆ 0.263315 ┆ B │ │ null ┆ egg ┆ 0.533739 ┆ C │ │ 5 ┆ null ┆ 0.014575 ┆ B │ └──────┴───────┴──────────┴────────┘ ``` 你可以通过表达式做很多事情,他们的表达能力很强以至于很多时候你有多种不同的方法得到同一个计算结果。 为了更好的理解表达式,让我们看更多的例子。 ### 统计不重复元素数量 我们可以统计一列中不重复元素的数量。注意这里我们采用了两种不同的方法得到了同一个结果。为了避免列名重复, 我们使用 `alias` 即别名表达式来重命名列名。 ```python out = df.select( [ pl.col("names").n_unique().alias("unique_names_1"), pl.col("names").unique().count().alias("unique_names_2"), ] ) print(out) ``` ```text shape: (1, 2) ┌────────────────┬────────────────┐ │ unique_names_1 ┆ unique_names_2 │ │ --- ┆ --- │ │ u32 ┆ u32 │ ╞════════════════╪════════════════╡ │ 5 ┆ 5 │ └────────────────┴────────────────┘ ``` ### 不同的聚合操作 我们可以完成不同的聚合操作,下面是一些例子,当然还有更多操作比如:`median`, `mean`, `first`等等。 ```python out = df.select( [ pl.sum("random").alias("sum"), # 对random列求和并新增一列 pl.min("random").alias("min"), # 对random列求最小值并新增一列 pl.max("random").alias("max"), # 对random列求最大值并新增一列 pl.col("random").max().alias("other_max"), # 另一种求最大值的方式 pl.std("random").alias("std dev"), # 对random列求标准差并新增一列 pl.var("random").alias("variance"), # 对random列求方差并新增一列 ] ) print(out) ``` ```text shape: (1, 6) ┌──────────┬──────────┬─────────┬───────────┬──────────┬──────────┐ │ sum ┆ min ┆ max ┆ other_max ┆ std dev ┆ variance │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞══════════╪══════════╪═════════╪═══════════╪══════════╪══════════╡ │ 1.705842 ┆ 0.014575 ┆ 0.74005 ┆ 0.74005 ┆ 0.293209 ┆ 0.085971 │ └──────────┴──────────┴─────────┴───────────┴──────────┴──────────┘ ``` ### 过滤和条件选择 当然,我们可以做一些复杂的事情,比如下面的例子中我们统计所有以 `am` 结尾的名字。 ```python out = df.select( [ pl.col("names").filter(pl.col("names").str.contains(r"am$")).count(), # str命名空间使用正则表达式 ] ) print(df) ``` ```text shape: (1, 1) ┌───────┐ │ names │ │ --- │ │ u32 │ ╞═══════╡ │ 2 │ └───────┘ ``` ### 二元函数和修改 下面的实例中,用一个条件语句创建一个表达式,我们使用 `when -> then -> otherwise` 的模式。 `when` 函数需要一个谓词表达式 (Predicate expression,因此返回一个布尔类型的 `Series`) 。 `then` 函数需要传入当谓词表达式结果为真时执行的表达式,而 `otherwise` 函数需要传入谓词表达式结果为 假时的表达式。 你可以传入任何表达式,包括简单的`pl.col("foo")`, `pl.lit(3)`, `pl.lit("bar")`等等。 最终,我们把结果与一个 `sum` 表达式相乘。 ```python out = df.select( [ pl.when(pl.col("random") > 0.5).then(0).otherwise(pl.col("random")) * pl.sum("nrs"), ] ) print(df) ``` ```text shape: (5, 1) ┌──────────┐ │ literal │ │ --- │ │ f64 │ ╞══════════╡ │ 1.695791 │ │ 0.0 │ │ 2.896465 │ │ 0.0 │ │ 0.160325 │ └──────────┘ ``` ### 窗口表达式 一个 polars 表达式可以隐式地进行 GROUPBY(分组)、AGGREGATION(聚合) 以及 JOIN(联合) 操作。 在下面的例子中,使用`over`函数,我们通过 `groups` 进行分组,在 `random` 列执行聚合加法。在下一个表达式中, 通过 `names` 进行分组,在 `random` 列执行聚合列表操作。 这些窗口函数还可以与其他表达式组合形成一个高效计算分组统计指标计算方法。 更多的分组函数[参考这里](POLARS_PY_REF_GUIDE/expression.html#aggregation)。 ```python df = df.select( [ pl.col("*"), # 选择所有列 pl.col("random").sum().over("groups").alias("sum[random]/groups"), ] ) print(df) ``` ```text shape: (5, 5) ┌──────┬───────┬──────────┬────────┬────────────────────┐ │ nrs ┆ names ┆ random ┆ groups ┆ sum[random]/groups │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ f64 │ ╞══════╪═══════╪══════════╪════════╪════════════════════╡ │ 1 ┆ foo ┆ 0.154163 ┆ A ┆ 0.894213 │ │ 2 ┆ ham ┆ 0.74005 ┆ A ┆ 0.894213 │ │ 3 ┆ spam ┆ 0.263315 ┆ B ┆ 0.27789 │ │ null ┆ egg ┆ 0.533739 ┆ C ┆ 0.533739 │ │ 5 ┆ null ┆ 0.014575 ┆ B ┆ 0.27789 │ └──────┴───────┴──────────┴────────┴────────────────────┘ ``` ## 结论 这里我们看到的表达式仅仅是冰山一角。`Polars` 提供了很多表达式,而且他们可以通过多种方式组合。 本篇文档是一个表达式的简介,帮助用户稍微了解如何使用表达式。下一章中我们会讨论在哪些场景 中可以使用表达式。在接下来的章节中,我们还会介绍如何在不同的 `groupby` 场景中使用表达式,并 确保 `Polars` 可以并行执行计算。