Python数据分析 - PolarsBook中文版: https://www.pythondataanalysis.com/docs/polars_book_cn/ - Polars快速入门: https://www.pythondataanalysis.com/docs/polars_book_cn/quickstart/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/expressions/ - Polars上下文: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/contexts/ - Polars分组: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/groupby/ - Polars折叠: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/folds/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/custom_functions/ - Polars实例: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/introduction_polars/ - Polars表达式方法: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/api/ - Polars视频介绍: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/video_intro/ - Polars与Numpy交互: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/numpy/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/window_functions/ - Polars索引: https://www.pythondataanalysis.com/docs/polars_book_cn/indexing/ - Polars数据类型: https://www.pythondataanalysis.com/docs/polars_book_cn/datatypes/ - 来自Pandas: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_pandas/ - 来自ApacheSpark: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_spark/ - Polars性能: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/strings/ - Polars优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/ - Polars惰性方法: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/ - 谓词下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/predicate-pushdown/ - 投影下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/projection-pushdown/ - 其它优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/other-optimizations/ - Polars参考指南: https://www.pythondataanalysis.com/docs/polars_book_cn/references/ - Polars时间序列: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/ - Polars时间序列实例: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/time-series/ - Polars使用范围: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/ - IO: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/ - Polars操作CSV文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/csv/ - Polars操作Parquet文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/parquet/ - Polars处理多个文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/multiple_files/ - Polars读取数据库: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/read_db/ - Polars与AWS交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/aws/ - Polars与Google BigQuery交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/google-big-query/ - Polars与Postgres交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/postgres/ - 互通性: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/ - Arrow: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/arrow/ - Numpy: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/numpy/ - 数据: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/strings/ - 时间戳: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/timestamps/ - 数据帧: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/ - 选中行或列: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/row_col_selection/ - 常用操作: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/common-manipulations/ - 聚合: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/aggregate/ - 分组: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/groupby/ - 过滤: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/filter/ - 连接: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/join/ - 重塑: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/melt/ - 条件应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/conditionally-apply/ - 排序: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/sorting/ - 透视: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/pivot/ - 应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/udfs/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/window-functions/ - Python数据分析 第二版: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/ - 第 1 章 准备工作: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-01/ - 第 2 章 Python 语法基础: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-02/ - 第 3 章 Python 的数据结构、函数和文件: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-03/ - 第 4 章 NumPy 基础:数组和向量计算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-04/ - 第 5 章 Pandas 入门: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-05/ - 第 6 章 数据加载、存储与文件格式: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-06/ - 第 7 章 数据清洗和准备: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-07/ - 第 10 章 数据聚合与分组运算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-10/ - 第 11 章 时间序列: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-11/ - 第 12 章 pandas 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-12/ - 第 13 章 Python 建模库介绍: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-13/ - 第 14 章 数据分析案例: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-14/ - 附录 A NumPy 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-A/ - 附录 B 更多关于 IPython 的内容: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-B/ - 第 8 章 数据规整:聚合、合并和重塑: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-08/ - 第 9 章 绘图和可视化: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-09/ - Polars用户指南: https://www.pythondataanalysis.com/docs/Polars_user_guide/ - Polars入门: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_getting_started/ - 安装Polars: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_installation/ - Polars核心概念: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/ - Polars数据类型和结构: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/data-types-and-structures/ - Polars表达式和上下文: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/expressions-and-contexts/ - Polars延迟API: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/lazy-api/ - Streaming: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/_streaming/ - Polars表达式: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/ - Polars基本操作: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/basic-operations/ - Aggregation: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/aggregation/ - Casting: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/casting/ - Categorical Data and Enums: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/categorical-data-and-enums/ - Expression Expansion: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/expression-expansion/ - Folds: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/folds/ - Lists and Arrays: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/lists-and-arrays/ - Missing Data: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/missing-data/ - Numpy Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/numpy-functions/ - Strings: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/strings/ - Structs: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/structs/ - User Defined Python Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/user-defined-python-functions/ - Window Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/window-functions/ - Reference: https://www.pythondataanalysis.com/docs/Polars_user_guide/api/reference/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/contributing/ - Versioning: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/versioning/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars-cloud/ - Ecosystem: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/ecosystem/ - Gpu Support: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/gpu-support/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/io/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/lazy/ - Pandas: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/pandas/ - Spark: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/spark/ - Arrow: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/arrow/ - Comparison: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/comparison/ - Multiprocessing: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/multiprocessing/ - Polars Llms: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/polars_llms/ - Styling: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/styling/ - Visualization: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/visualization/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/plugins/ - Create: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/create/ - Cte: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/cte/ - Intro: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/intro/ - Select: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/select/ - Show: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/show/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/transformations/ # Polars表达式和上下文 Polars开发了其专有的领域特定语言(DSL)用于转换数据。这种语言非常易于使用,能够进行复杂的查询,同时查询语句仍具有可读性。这里将介绍的表达式和上下文,对于实现这种可读性至关重要,同时它们也能让Polars查询引擎对你的查询进行优化,使其尽可能快速地运行。 ## Polars表达式 在Polars中,表达式是数据转换的一种惰性表示形式。表达式具有模块化和灵活性,这意味着你可以将它们用作构建块来构建更复杂的表达式。以下是一个Polars表达式的示例: ```python import polars as pl pl.col("weight") / (pl.col("height") ** 2) ``` 正如你可能猜到的,这个表达式选取了名为“体重(weight)”的一列,并将该列的值除以“身高(height)”列中值的平方,从而计算出一个人的身体质量指数(BMI)。 上面的代码表达了一种抽象的计算过程,我们可以将其保存到一个变量中,进一步进行处理,或者直接打印出来: ```python bmi_expr = pl.col("weight") / (pl.col("height") ** 2) print(bmi_expr) ``` ``` [(col("weight")) / (col("height").pow([dyn int: 2]))] ``` 由于表达式是惰性的,目前尚未进行任何计算。这就是我们需要上下文的原因。 ## Polars上下文 Polars表达式需要在一个上下文中执行才能产生结果。根据其使用的上下文不同,同一个Polars表达式可能会产生不同的结果。在本节中,我们将了解Polars提供的四种最常见的上下文 : 1. `select` 2. `with_columns` 3. `filter` 4. `group_by` 我们使用下面的dataframe来展示每种上下文是如何工作的。 ```python from datetime import date df = pl.DataFrame( { "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"], "birthdate": [ date(1997, 1, 10), date(1985, 2, 15), date(1983, 3, 22), date(1981, 4, 30), ], "weight": [57.9, 72.5, 53.6, 83.1], # (kg) "height": [1.56, 1.77, 1.65, 1.75], # (m) } ) print(df) ``` ``` shape: (4, 4) ┌────────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │ └────────────────┴────────────┴────────┴────────┘ ``` ### `select` 选择上下文 `select` 会对列应用表达式。`select` 上下文可能会生成新的列,这些新列可以是聚合结果、其他列的组合,或者是字面值: ```python result = df.select( bmi=bmi_expr, avg_bmi=bmi_expr.mean(), ideal_max_bmi=25, ) print(result) ``` ``` shape: (4, 3) ┌───────────┬───────────┬───────────────┐ │ bmi ┆ avg_bmi ┆ ideal_max_bmi │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ i32 │ ╞═══════════╪═══════════╪═══════════════╡ │ 23.791913 ┆ 23.438973 ┆ 25 │ │ 23.141498 ┆ 23.438973 ┆ 25 │ │ 19.687787 ┆ 23.438973 ┆ 25 │ │ 27.134694 ┆ 23.438973 ┆ 25 │ └───────────┴───────────┴───────────────┘ ``` 在 `select` 上下文中的表达式必须生成长度全部相同的序列,或者它们必须生成一个标量值。标量值会被广播以匹配其余序列的长度。字面值,就像上面使用的数字,同样也会被广播。 请注意,广播也可能在表达式内部发生。例如,考虑下面的表达式: ```python result = df.select(deviation=(bmi_expr - bmi_expr.mean()) / bmi_expr.std()) print(result) ``` ``` shape: (4, 1) ┌───────────┐ │ deviation │ │ --- │ │ f64 │ ╞═══════════╡ │ 0.115645 │ │ -0.097471 │ │ -1.22912 │ │ 1.210946 │ └───────────┘ ``` 减法和除法在表达式内部都使用了广播操作,因为计算均值和标准差的子表达式计算结果是单个值。 选择上下文`select` 上下文非常灵活且功能强大,它允许你独立且并行地计算任意表达式。我们接下来要介绍的其他上下文也是如此。 ### `with_columns` `with_columns` 上下文与 `select` 上下文非常相似。二者的主要区别在于,`with_columns` 上下文会创建一个新的dataframe,该dataframe包含原始dataframe中的列以及根据其输入表达式生成的新列,而 `select` 上下文仅包含由其输入表达式所选择的列。 ```python result = df.with_columns( bmi=bmi_expr, avg_bmi=bmi_expr.mean(), ideal_max_bmi=25, ) print(result) ``` ``` shape: (4, 7) ┌────────────────┬────────────┬────────┬────────┬───────────┬───────────┬───────────────┐ │ name ┆ birthdate ┆ weight ┆ height ┆ bmi ┆ avg_bmi ┆ ideal_max_bmi │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ i32 │ ╞════════════════╪════════════╪════════╪════════╪═══════════╪═══════════╪═══════════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 ┆ 23.791913 ┆ 23.438973 ┆ 25 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 ┆ 23.141498 ┆ 23.438973 ┆ 25 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 ┆ 19.687787 ┆ 23.438973 ┆ 25 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 ┆ 27.134694 ┆ 23.438973 ┆ 25 │ └────────────────┴────────────┴────────┴────────┴───────────┴───────────┴───────────────┘ ``` 由于 `select` 和 `with_columns` 之间存在这种差异,在 `with_columns` 上下文中使用的表达式必须生成与dataframe中原始列长度相同的序列,而在 `select` 上下文中,表达式只需生成彼此长度相同的序列即可。 ### `filter` `filter` 上下文根据一个或多个计算结果为布尔数据类型的表达式,对dataframe的行进行筛选。 ```python result = df.filter( pl.col("birthdate").is_between(date(1982, 12, 31), date(1996, 1, 1)), pl.col("height") > 1.7, ) print(result) ``` ``` shape: (1, 4) ┌───────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞═══════════╪════════════╪════════╪════════╡ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ └───────────┴────────────┴────────┴────────┘ ``` ### `group_by`以及聚合操作 在 `group_by` 上下文里,dataframe的行是根据分组表达式的唯一值来进行分组的。然后,你可以对得到的各个组应用表达式,这些组的长度可能各不相同。 当使用 `group_by` 上下文时,你可以使用一个表达式来动态地计算分组情况: ```python result = df.group_by( (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"), ).agg(pl.col("name")) print(result) ``` ``` shape: (2, 2) ┌────────┬─────────────────────────────────┐ │ decade ┆ name │ │ --- ┆ --- │ │ i32 ┆ list[str] │ ╞════════╪═════════════════════════════════╡ │ 1980 ┆ ["Ben Brown", "Chloe Cooper", … │ │ 1990 ┆ ["Alice Archer"] │ └────────┴─────────────────────────────────┘ ``` 在使用 `group_by` 之后,我们使用 `agg` 将聚合表达式应用于各个组。由于在上面的示例中我们只指定了一列的名称,所以我们得到了该列的各个组以列表形式呈现。 我们可以根据需要指定任意数量的分组表达式,`group_by` 上下文会根据所指定表达式的不同取值组合对行进行分组。在这里,我们根据出生年代以及此人身高是否低于1.7米的组合情况来进行分组: ```python result = df.group_by( (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"), (pl.col("height") < 1.7).alias("short?"), ).agg(pl.col("name")) print(result) ``` ``` shape: (3, 3) ┌────────┬────────┬─────────────────────────────────┐ │ decade ┆ short? ┆ name │ │ --- ┆ --- ┆ --- │ │ i32 ┆ bool ┆ list[str] │ ╞════════╪════════╪═════════════════════════════════╡ │ 1980 ┆ true ┆ ["Chloe Cooper"] │ │ 1980 ┆ false ┆ ["Ben Brown", "Daniel Donovan"… │ │ 1990 ┆ true ┆ ["Alice Archer"] │ └────────┴────────┴─────────────────────────────────┘ ``` 应用聚合表达式后得到的dataframe,在左侧会为每个分组表达式对应包含一列,然后会根据需要包含相应数量的列来表示聚合表达式的结果。反过来,我们可以根据自己的需求指定任意数量的聚合表达式: ```python result = df.group_by( (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"), (pl.col("height") < 1.7).alias("short?"), ).agg( pl.len(), pl.col("height").max().alias("tallest"), pl.col("weight", "height").mean().name.prefix("avg_"), ) print(result) ``` ``` shape: (3, 6) ┌────────┬────────┬─────┬─────────┬────────────┬────────────┐ │ decade ┆ short? ┆ len ┆ tallest ┆ avg_weight ┆ avg_height │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i32 ┆ bool ┆ u32 ┆ f64 ┆ f64 ┆ f64 │ ╞════════╪════════╪═════╪═════════╪════════════╪════════════╡ │ 1990 ┆ true ┆ 1 ┆ 1.56 ┆ 57.9 ┆ 1.56 │ │ 1980 ┆ false ┆ 2 ┆ 1.77 ┆ 77.8 ┆ 1.76 │ │ 1980 ┆ true ┆ 1 ┆ 1.65 ┆ 53.6 ┆ 1.65 │ └────────┴────────┴─────┴─────────┴────────────┴────────────┘ ``` 另请参阅 `group_by_dynamic` 和 `rolling` 以了解其他分组上下文。 ## 表达式展开 上一个示例包含两个分组表达式和三个聚合表达式,但得到的dataframe却有六列,而不是五列。如果我们仔细观察,最后一个聚合表达式提到了两列不同的列:“体重(weight)” 和 “身高(height)”。 Polars 表达式支持一种称为表达式展开的功能。表达式展开就像是一种速记符号,当你想要对多列应用相同的转换操作时可以使用它。正如我们所看到的,该表达式 ```python pl.col("weight", "height").mean().name.prefix("avg_") ``` 将会计算“体重(weight)”和“身高(height)”这两列的平均值,并会分别将它们重命名为“平均体重(avg_weight)”和“平均身高(avg_height)”。实际上,上面的表达式等同于使用以下两个表达式: ```python [ pl.col("weight").mean().alias("avg_weight"), pl.col("height").mean().alias("avg_height"), ] ``` 在这种情况下,这个表达式会展开为两个独立的表达式,Polars 可以并行执行它们。在其他情况下,我们可能无法事先知道一个表达式会展开成多少个独立的表达式。 考虑这个简单但能说明问题的例子: ```python (pl.col(pl.Float64) * 1.1).name.suffix("*1.1") ``` 这个表达式会将所有数据类型为 `Float64` 的列乘以 `1.1`。应用该操作的列数取决于每个dataframe的模式(schema)。就我们一直在使用的数据框而言,它会应用于两列: ```python expr = (pl.col(pl.Float64) * 1.1).name.suffix("*1.1") result = df.select(expr) print(result) ``` ``` shape: (4, 2) ┌────────────┬────────────┐ │ weight*1.1 ┆ height*1.1 │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞════════════╪════════════╡ │ 63.69 ┆ 1.716 │ │ 79.75 ┆ 1.947 │ │ 58.96 ┆ 1.815 │ │ 91.41 ┆ 1.925 │ └────────────┴────────────┘ ``` 对于下面的dataframe `df2` 来说,同样的表达式展开后为 0 列,因为没有任何列的数据类型是 `Float64`: ```python df2 = pl.DataFrame( { "ints": [1, 2, 3, 4], "letters": ["A", "B", "C", "D"], } ) result = df2.select(expr) print(result) ``` ``` shape: (0, 0) ┌┐ ╞╡ └┘ ``` 同样不难想象这样一种情形:同一个表达式可能会展开成几十列。 接下来,你将了解[延迟API(lazy API)以及 `explain` 函数](lazy-api.md#previewing-the-query-plan),借助这些你可以在给定dataframe模式(schema)的情况下,预览一个表达式将会展开成什么样。 ## 结论 由于表达式是延迟执行的,当你在某个上下文中使用一个表达式时,Polars 可以在运行该表达式所表示的数据转换操作之前,尝试对表达式进行简化。在一个上下文中,各个独立的表达式可以很容易地并行执行,Polars 会利用这一特性,并且在使用表达式展开时,也会将表达式的执行并行化。使用 Polars 的[延迟API](lazy-api.md)(接下来会介绍)时,还能进一步提升性能。 我们仅仅触及了表达式功能的皮毛。实际上还有大量更多的表达式,并且它们可以通过多种方式组合使用。想要深入了解可用的不同类型的表达式,请参阅关于[表达式的章节](../expressions/index.md)。