Python数据分析 - PolarsBook中文版: https://www.pythondataanalysis.com/docs/polars_book_cn/ - Polars快速入门: https://www.pythondataanalysis.com/docs/polars_book_cn/quickstart/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/expressions/ - Polars上下文: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/contexts/ - Polars分组: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/groupby/ - Polars折叠: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/folds/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/custom_functions/ - Polars实例: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/introduction_polars/ - Polars表达式方法: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/api/ - Polars视频介绍: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/video_intro/ - Polars与Numpy交互: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/numpy/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/window_functions/ - Polars索引: https://www.pythondataanalysis.com/docs/polars_book_cn/indexing/ - Polars数据类型: https://www.pythondataanalysis.com/docs/polars_book_cn/datatypes/ - 来自Pandas: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_pandas/ - 来自ApacheSpark: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_spark/ - Polars性能: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/strings/ - Polars优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/ - Polars惰性方法: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/ - 谓词下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/predicate-pushdown/ - 投影下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/projection-pushdown/ - 其它优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/other-optimizations/ - Polars参考指南: https://www.pythondataanalysis.com/docs/polars_book_cn/references/ - Polars时间序列: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/ - Polars时间序列实例: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/time-series/ - Polars使用范围: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/ - IO: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/ - Polars操作CSV文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/csv/ - Polars操作Parquet文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/parquet/ - Polars处理多个文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/multiple_files/ - Polars读取数据库: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/read_db/ - Polars与AWS交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/aws/ - Polars与Google BigQuery交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/google-big-query/ - Polars与Postgres交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/postgres/ - 互通性: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/ - Arrow: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/arrow/ - Numpy: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/numpy/ - 数据: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/strings/ - 时间戳: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/timestamps/ - 数据帧: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/ - 选中行或列: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/row_col_selection/ - 常用操作: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/common-manipulations/ - 聚合: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/aggregate/ - 分组: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/groupby/ - 过滤: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/filter/ - 连接: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/join/ - 重塑: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/melt/ - 条件应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/conditionally-apply/ - 排序: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/sorting/ - 透视: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/pivot/ - 应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/udfs/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/window-functions/ - Python数据分析 第二版: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/ - 第 1 章 准备工作: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-01/ - 第 2 章 Python 语法基础: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-02/ - 第 3 章 Python 的数据结构、函数和文件: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-03/ - 第 4 章 NumPy 基础:数组和向量计算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-04/ - 第 5 章 Pandas 入门: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-05/ - 第 6 章 数据加载、存储与文件格式: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-06/ - 第 7 章 数据清洗和准备: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-07/ - 第 10 章 数据聚合与分组运算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-10/ - 第 11 章 时间序列: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-11/ - 第 12 章 pandas 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-12/ - 第 13 章 Python 建模库介绍: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-13/ - 第 14 章 数据分析案例: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-14/ - 附录 A NumPy 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-A/ - 附录 B 更多关于 IPython 的内容: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-B/ - 第 8 章 数据规整:聚合、合并和重塑: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-08/ - 第 9 章 绘图和可视化: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-09/ - Polars用户指南: https://www.pythondataanalysis.com/docs/Polars_user_guide/ - Polars入门: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_getting_started/ - 安装Polars: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_installation/ - Polars核心概念: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/ - Polars数据类型和结构: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/data-types-and-structures/ - Polars表达式和上下文: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/expressions-and-contexts/ - Polars延迟API: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/lazy-api/ - Streaming: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/_streaming/ - Polars表达式: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/ - Polars基本操作: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/basic-operations/ - Aggregation: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/aggregation/ - Casting: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/casting/ - Categorical Data and Enums: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/categorical-data-and-enums/ - Expression Expansion: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/expression-expansion/ - Folds: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/folds/ - Lists and Arrays: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/lists-and-arrays/ - Missing Data: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/missing-data/ - Numpy Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/numpy-functions/ - Strings: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/strings/ - Structs: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/structs/ - User Defined Python Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/user-defined-python-functions/ - Window Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/window-functions/ - Reference: https://www.pythondataanalysis.com/docs/Polars_user_guide/api/reference/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/contributing/ - Versioning: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/versioning/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars-cloud/ - Ecosystem: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/ecosystem/ - Gpu Support: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/gpu-support/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/io/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/lazy/ - Pandas: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/pandas/ - Spark: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/spark/ - Arrow: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/arrow/ - Comparison: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/comparison/ - Multiprocessing: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/multiprocessing/ - Polars Llms: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/polars_llms/ - Styling: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/styling/ - Visualization: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/visualization/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/plugins/ - Create: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/create/ - Cte: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/cte/ - Intro: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/intro/ - Select: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/select/ - Show: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/show/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/transformations/ # Lists and arrays Polars has first-class support for two homogeneous container data types: `List` and `Array`. Polars supports many operations with the two data types and their APIs overlap, so this section of the user guide has the objective of clarifying when one data type should be chosen in favour of the other. ## Lists vs arrays ### The data type `List` The data type list is suitable for columns whose values are homogeneous 1D containers of varying lengths. The dataframe below contains three examples of columns with the data type `List`: {{code_block('user-guide/expressions/lists', 'list-example', ['List'])}} ```python exec="on" result="text" session="expressions/lists" --8<-- "python/user-guide/expressions/lists.py:list-example" ``` Note that the data type `List` is different from Python's type `list`, where elements can be of any type. If you want to store true Python lists in a column, you can do so with the data type `Object` and your column will not have the list manipulation features that we're about to discuss. ### The data type `Array` The data type `Array` is suitable for columns whose values are homogeneous containers of an arbitrary dimension with a known and fixed shape. The dataframe below contains two examples of columns with the data type `Array`. {{code_block('user-guide/expressions/lists', 'array-example', ['Array'])}} ```python exec="on" result="text" session="expressions/lists" --8<-- "python/user-guide/expressions/lists.py:array-example" ``` The example above shows how to specify that the columns “bit_flags” and “tic_tac_toe” have the data type `Array`, parametrised by the data type of the elements contained within and by the shape of each array. In general, Polars does not infer that a column has the data type `Array` for performance reasons, and defaults to the appropriate variant of the data type `List`. In Python, an exception to this rule is when you provide a NumPy array to build a column. In that case, Polars has the guarantee from NumPy that all subarrays have the same shape, so an array of $n + 1$ dimensions will generate a column of $n$ dimensional arrays: {{code_block('user-guide/expressions/lists', 'numpy-array-inference', ['Array'])}} ```python exec="on" result="text" session="expressions/lists" --8<-- "python/user-guide/expressions/lists.py:numpy-array-inference" ``` ### When to use each In short, prefer the data type `Array` over `List` because it is more memory efficient and more performant. If you cannot use `Array`, then use `List`: - when the values within a column do not have a fixed shape; or - when you need functions that are only available in the list API. ## Working with lists ### The namespace `list` Polars provides many functions to work with values of the data type `List` and these are grouped inside the namespace `list`. We will explore this namespace a bit now. !!! warning "`arr` then, `list` now" In previous versions of Polars, the namespace for list operations used to be `arr`. `arr` is now the namespace for the data type `Array`. If you find references to the namespace `arr` on StackOverflow or other sources, note that those sources _may_ be outdated. The dataframe `weather` defined below contains data from different weather stations across a region. When the weather station is unable to get a result, an error code is recorded instead of the actual temperature at that time. {{code_block('user-guide/expressions/lists', 'weather', [])}} ```python exec="on" result="text" session="expressions/lists" --8<-- "python/user-guide/expressions/lists.py:weather" ``` ### Programmatically creating lists Given the dataframe `weather` defined previously, it is very likely we need to run some analysis on the temperatures that are captured by each station. To make this happen, we need to first be able to get individual temperature measurements. We [can use the namespace `str`](strings.md#the-string-namespace) for this: {{code_block('user-guide/expressions/lists', 'split', ['str.split'])}} ```python exec="on" result="text" session="expressions/lists" --8<-- "python/user-guide/expressions/lists.py:split" ``` A natural follow-up would be to explode the list of temperatures so that each measurement is in its own row: {{code_block('user-guide/expressions/lists', 'explode', ['explode'])}} ```python exec="on" result="text" session="expressions/lists" --8<-- "python/user-guide/expressions/lists.py:explode" ``` However, in Polars we often do not need to do this to operate on the list elements. ### Operating on lists Polars provides several standard operations on columns with the `List` data type. [Similar to what you can do with strings](strings.md#slicing), lists can be sliced with the functions `head`, `tail`, and `slice`: {{code_block('user-guide/expressions/lists', 'list-slicing', ['Expr.list'])}} ```python exec="on" result="text" session="expressions/lists" --8<-- "python/user-guide/expressions/lists.py:list-slicing" ``` ### Element-wise computation within lists If we need to identify the stations that are giving the most number of errors we need to 1. try to convert the measurements into numbers; 2. count the number of non-numeric values (i.e., `null` values) in the list, by row; and 3. rename this output column as “errors” so that we can easily identify the stations. To perform these steps, we need to perform a casting operation on each measurement within the list values. The function `eval` is used as the entry point to perform operations on the elements of the list. Within it, you can use the context `element` to refer to each single element of the list individually, and then you can use any Polars expression on the element: {{code_block('user-guide/expressions/lists', 'element-wise-casting', ['element'])}} ```python exec="on" result="text" session="expressions/lists" --8<-- "python/user-guide/expressions/lists.py:element-wise-casting" ``` Another alternative would be to use a regular expression to check if a measurement starts with a letter: {{code_block('user-guide/expressions/lists', 'element-wise-regex', ['element'])}} ```python exec="on" result="text" session="expressions/lists" --8<-- "python/user-guide/expressions/lists.py:element-wise-regex" ``` If you are unfamiliar with the namespace `str` or the notation `(?i)` in the regex, now is a good time to [look at how to work with strings and regular expressions in Polars](strings.md#check-for-the-existence-of-a-pattern). ### Row-wise computations The function `eval` gives us access to the list elements and `pl.element` refers to each individual element, but we can also use `pl.all()` to refer to all of the elements of the list. To show this in action, we will start by creating another dataframe with some more weather data: {{code_block('user-guide/expressions/lists', 'weather_by_day', [])}} ```python exec="on" result="text" session="expressions/lists" --8<-- "python/user-guide/expressions/lists.py:weather_by_day" ``` Now, we will calculate the percentage rank of the temperatures by day, measured across stations. Polars does not provide a function to do this directly, but because expressions are so versatile we can create our own percentage rank expression for highest temperature. Let's try that: {{code_block('user-guide/expressions/lists', 'rank_pct', ['element', 'rank'])}} ```python exec="on" result="text" session="expressions/lists" --8<-- "python/user-guide/expressions/lists.py:rank_pct" ``` ## Working with arrays ### Creating an array column As [we have seen above](#the-data-type-array), Polars usually does not infer the data type `Array` automatically. You have to specify the data type `Array` when creating a series/dataframe or [cast a column](casting.md) explicitly unless you create the column out of a NumPy array. ### The namespace `arr` The data type `Array` was recently introduced and is still pretty nascent in features that it offers. Even so, the namespace `arr` aggregates several functions that you can use to work with arrays. !!! warning "`arr` then, `list` now" In previous versions of Polars, the namespace for list operations used to be `arr`. `arr` is now the namespace for the data type `Array`. If you find references to the namespace `arr` on StackOverflow or other sources, note that those sources _may_ be outdated. The API documentation should give you a good overview of the functions in the namespace `arr`, of which we present a couple: {{code_block('user-guide/expressions/lists', 'array-overview', ['Expr.arr'])}} ```python exec="on" result="text" session="expressions/lists" --8<-- "python/user-guide/expressions/lists.py:array-overview" ```