Python数据分析 - PolarsBook中文版: https://www.pythondataanalysis.com/docs/polars_book_cn/ - Polars快速入门: https://www.pythondataanalysis.com/docs/polars_book_cn/quickstart/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/expressions/ - Polars上下文: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/contexts/ - Polars分组: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/groupby/ - Polars折叠: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/folds/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/custom_functions/ - Polars实例: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/introduction_polars/ - Polars表达式方法: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/api/ - Polars视频介绍: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/video_intro/ - Polars与Numpy交互: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/numpy/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/window_functions/ - Polars索引: https://www.pythondataanalysis.com/docs/polars_book_cn/indexing/ - Polars数据类型: https://www.pythondataanalysis.com/docs/polars_book_cn/datatypes/ - 来自Pandas: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_pandas/ - 来自ApacheSpark: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_spark/ - Polars性能: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/strings/ - Polars优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/ - Polars惰性方法: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/ - 谓词下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/predicate-pushdown/ - 投影下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/projection-pushdown/ - 其它优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/other-optimizations/ - Polars参考指南: https://www.pythondataanalysis.com/docs/polars_book_cn/references/ - Polars时间序列: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/ - Polars时间序列实例: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/time-series/ - Polars使用范围: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/ - IO: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/ - Polars操作CSV文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/csv/ - Polars操作Parquet文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/parquet/ - Polars处理多个文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/multiple_files/ - Polars读取数据库: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/read_db/ - Polars与AWS交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/aws/ - Polars与Google BigQuery交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/google-big-query/ - Polars与Postgres交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/postgres/ - 互通性: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/ - Arrow: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/arrow/ - Numpy: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/numpy/ - 数据: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/strings/ - 时间戳: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/timestamps/ - 数据帧: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/ - 选中行或列: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/row_col_selection/ - 常用操作: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/common-manipulations/ - 聚合: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/aggregate/ - 分组: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/groupby/ - 过滤: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/filter/ - 连接: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/join/ - 重塑: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/melt/ - 条件应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/conditionally-apply/ - 排序: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/sorting/ - 透视: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/pivot/ - 应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/udfs/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/window-functions/ - Python数据分析 第二版: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/ - 第 1 章 准备工作: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-01/ - 第 2 章 Python 语法基础: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-02/ - 第 3 章 Python 的数据结构、函数和文件: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-03/ - 第 4 章 NumPy 基础:数组和向量计算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-04/ - 第 5 章 Pandas 入门: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-05/ - 第 6 章 数据加载、存储与文件格式: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-06/ - 第 7 章 数据清洗和准备: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-07/ - 第 10 章 数据聚合与分组运算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-10/ - 第 11 章 时间序列: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-11/ - 第 12 章 pandas 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-12/ - 第 13 章 Python 建模库介绍: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-13/ - 第 14 章 数据分析案例: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-14/ - 附录 A NumPy 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-A/ - 附录 B 更多关于 IPython 的内容: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-B/ - 第 8 章 数据规整:聚合、合并和重塑: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-08/ - 第 9 章 绘图和可视化: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-09/ - Polars用户指南: https://www.pythondataanalysis.com/docs/Polars_user_guide/ - Polars入门: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_getting_started/ - 安装Polars: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_installation/ - Polars核心概念: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/ - Polars数据类型和结构: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/data-types-and-structures/ - Polars表达式和上下文: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/expressions-and-contexts/ - Polars延迟API: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/lazy-api/ - Streaming: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/_streaming/ - Polars表达式: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/ - Polars基本操作: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/basic-operations/ - Aggregation: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/aggregation/ - Casting: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/casting/ - Categorical Data and Enums: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/categorical-data-and-enums/ - Expression Expansion: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/expression-expansion/ - Folds: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/folds/ - Lists and Arrays: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/lists-and-arrays/ - Missing Data: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/missing-data/ - Numpy Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/numpy-functions/ - Strings: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/strings/ - Structs: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/structs/ - User Defined Python Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/user-defined-python-functions/ - Window Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/window-functions/ - Reference: https://www.pythondataanalysis.com/docs/Polars_user_guide/api/reference/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/contributing/ - Versioning: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/versioning/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars-cloud/ - Ecosystem: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/ecosystem/ - Gpu Support: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/gpu-support/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/io/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/lazy/ - Pandas: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/pandas/ - Spark: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/spark/ - Arrow: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/arrow/ - Comparison: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/comparison/ - Multiprocessing: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/multiprocessing/ - Polars Llms: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/polars_llms/ - Styling: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/styling/ - Visualization: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/visualization/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/plugins/ - Create: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/create/ - Cte: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/cte/ - Intro: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/intro/ - Select: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/select/ - Show: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/show/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/transformations/ # Polars自定义函数 现在你应该相信,polar表达式是如此的强大和灵活,以至于对自定义python函数的需求比你在其他库中可能需要的要少得多。 尽管如此,你仍然需要有能力将表达式传递给第三方库,或者将你的黑匣子函数应用于polar数据。 为此,我们提供了以下几种表达式: - `map` - `apply` ## map 在操作方式上和最终向用户传递的数据上,`map`和`apply`函数有重要的区别。 `map`函数将表达式所支持的`Series`数据原封不动的传递。 `map`函数在`select`和`groupby`中遵循相同的规则,这将意味着`Series`代表`DataFrame`中的一个列。注意,在`groupby`情况下,该列还没有被分组! `map`函数的用法是将表达式中的`Series`传递给第三方库。下面我们展示了如何使用`map`将一个表达式列传递给神经网络模型。 ```python df.with_column([ pl.col("features").map(lambda s: MyNeuralNetwork.forward(s.to_numpy())).alias("activations") ]) ``` 在`groupby`中,`map`的使用情况很有限。它们只用于性能方面,但很容易导致不正确的结果。让我们来解释一下原因。 ```python df = pl.DataFrame( { "keys": ["a", "a", "b"], "values": [10, 7, 1], } ) out = df.groupby("keys", maintain_order=True).agg( [ pl.col("values").map(lambda s: s.shift()).alias("shift_map"), pl.col("values").shift().alias("shift_expression"), ] ) print(df) shape: (3, 2) ┌──────┬────────┐ │ keys ┆ values │ │ --- ┆ --- │ │ str ┆ i64 │ ╞══════╪════════╡ │ a ┆ 10 │ ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ a ┆ 7 │ ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ b ┆ 1 │ └──────┴────────┘ ``` 在上面的片段中,我们按`"keys"`列分组。这意味着我们有以下几个组。 ```c "a" -> [10, 7] "b" -> [1] ``` 如果我们再向右应用一个`shift`操作,我们就会发现。 ```c "a" -> [null, 10] "b" -> [null] ``` 现在,让我们打印一下得到的结果: ```python print(out) shape: (2, 3) ┌──────┬────────────┬──────────────────┐ │ keys ┆ shift_map ┆ shift_expression │ │ --- ┆ --- ┆ --- │ │ str ┆ list[i64] ┆ list[i64] │ ╞══════╪════════════╪══════════════════╡ │ a ┆ [null, 10] ┆ [null, 10] │ ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ b ┆ [7] ┆ [null] │ └──────┴────────────┴──────────────────┘ ``` 😯.. 很明显,我们得到了一个错误答案。`"b"`组甚至从`"a"`组拿到了一个值7😵. 这是一个可怕的错误,因为`map`在我们聚合之前就应用了这个函数!这意味着整个列`[10, 7, 1]`先向右移向到了`[null, 10, 7]`,然后再被聚合。 所以我们的建议是,除非你知道你需要使用`map`并且知道你在做什么,否则永远不要在`groupby`时使用`map`。 ## apply 幸运的是,我们可以用`apply`来解决之前的例子。`apply`可以对该操作的最小的逻辑元素起作用。 这就意味着: - `select`-> 单个元素 - `groupby`-> 单个分组 因此,我们可以用`apply`来解决我们上述的问题: ```python out = df.groupby("keys", maintain_order=True).agg( [ pl.col("values").apply(lambda s: s.shift()).alias("shift_map"), pl.col("values").shift().alias("shift_expression"), ] ) print(out) shape: (2, 3) ┌──────┬────────────┬──────────────────┐ │ keys ┆ shift_map ┆ shift_expression │ │ --- ┆ --- ┆ --- │ │ str ┆ list[i64] ┆ list[i64] │ ╞══════╪════════════╪══════════════════╡ │ a ┆ [null, 10] ┆ [null, 10] │ ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ b ┆ [null] ┆ [null] │ └──────┴────────────┴──────────────────┘ ``` 可以看到,我们得到了正确的结果! 🎉 ## `select`中的`apply` 在`select`中,`apply`表达式将列的元素传递给python函数。 *注意,你现在正在运行Python,这将会很慢。* 让我们通过一些例子来看看会发生什么。我们将继续使用我们在本节开始时定义的`DataFrame`,并展示一个使用`apply`函数的例子和一个使用表达式API实现相同目标的反例。 ### 添加一个计数器 在这个例子中,我们创建了一个全局的 `counter` (计数器),然后在每处理一个元素时将整数 `1` 添加到全局状态中。每个迭代的增量结果将被添加到元素值中。 ```python counter = 0 def add_counter(val: int) -> int: global counter counter += 1 return counter + val out = df.select( [ pl.col("values").apply(add_counter).alias("solution_apply"), (pl.col("values") + pl.arange(1, pl.count() + 1)).alias("solution_expr"), ] ) print(out) shape: (3, 2) ┌────────────────┬───────────────┐ │ solution_apply ┆ solution_expr │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞════════════════╪═══════════════╡ │ 11 ┆ 11 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 9 ┆ 9 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 4 ┆ 4 │ └────────────────┴───────────────┘ ``` ### 合并多列值 如果我们想在一次`apply`函数调用中访问不同列的值,我们可以创建`struct`数据类型。这种数据类型将这些列作为字段收集在`struct`中。因此,如果我们从列`"keys"`和`"values"`中创建一个`struct`,我们会得到以下结构元素。 ```python [ {"keys": "a", "values": 10}, {"keys": "a", "values": 7}, {"keys": "b", "values": 1}, ] ``` 这些将作为`dict`传递给调用的Python函数,因此可以通过`field: str`进行索引。 ```python out = df.select( [ pl.struct(["keys", "values"]).apply(lambda x: len(x["keys"]) + x["values"]).alias("solution_apply"), (pl.col("keys").str.lengths() + pl.col("values")).alias("solution_expr"), ] ) print(out) shape: (3, 2) ┌────────────────┬───────────────┐ │ solution_apply ┆ solution_expr │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞════════════════╪═══════════════╡ │ 11 ┆ 11 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 8 ┆ 8 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2 ┆ 2 │ └────────────────┴───────────────┘ ``` ### 返回类型 自定义Python函数对polar而言是黑箱。我们真的不知道你在做什么黑科技,所以我们不得不推断并尽力去理解你的意思。 数据类型是自动推断出来的。我们通过等待第一个非空值来做到这一点。这个值将被用来确定`Series`的类型。 python类型与polars数据类型的映射如下: - `int`->`Int64` - `float`->`Float64` - `bool`->`Boolean` - `str`->`Utf8` - `list[tp]`->`List[tp]`(其中内部类型的推断规则相同) - `dict[str, [tp]]`->`struct` - `Any`->`object`(在任何时候都要防止这种情况) 作为一个用户,我们希望您能了解我们的工作,以便能更好地利用自定义函数。