Python数据分析 - PolarsBook中文版: https://www.pythondataanalysis.com/docs/polars_book_cn/ - Polars快速入门: https://www.pythondataanalysis.com/docs/polars_book_cn/quickstart/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/expressions/ - Polars上下文: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/contexts/ - Polars分组: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/groupby/ - Polars折叠: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/folds/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/custom_functions/ - Polars实例: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/introduction_polars/ - Polars表达式方法: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/api/ - Polars视频介绍: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/video_intro/ - Polars与Numpy交互: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/numpy/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/window_functions/ - Polars索引: https://www.pythondataanalysis.com/docs/polars_book_cn/indexing/ - Polars数据类型: https://www.pythondataanalysis.com/docs/polars_book_cn/datatypes/ - 来自Pandas: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_pandas/ - 来自ApacheSpark: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_spark/ - Polars性能: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/strings/ - Polars优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/ - Polars惰性方法: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/ - 谓词下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/predicate-pushdown/ - 投影下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/projection-pushdown/ - 其它优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/other-optimizations/ - Polars参考指南: https://www.pythondataanalysis.com/docs/polars_book_cn/references/ - Polars时间序列: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/ - Polars时间序列实例: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/time-series/ - Polars使用范围: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/ - IO: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/ - Polars操作CSV文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/csv/ - Polars操作Parquet文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/parquet/ - Polars处理多个文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/multiple_files/ - Polars读取数据库: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/read_db/ - Polars与AWS交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/aws/ - Polars与Google BigQuery交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/google-big-query/ - Polars与Postgres交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/postgres/ - 互通性: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/ - Arrow: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/arrow/ - Numpy: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/numpy/ - 数据: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/strings/ - 时间戳: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/timestamps/ - 数据帧: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/ - 选中行或列: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/row_col_selection/ - 常用操作: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/common-manipulations/ - 聚合: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/aggregate/ - 分组: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/groupby/ - 过滤: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/filter/ - 连接: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/join/ - 重塑: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/melt/ - 条件应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/conditionally-apply/ - 排序: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/sorting/ - 透视: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/pivot/ - 应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/udfs/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/window-functions/ - Python数据分析 第二版: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/ - 第 1 章 准备工作: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-01/ - 第 2 章 Python 语法基础: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-02/ - 第 3 章 Python 的数据结构、函数和文件: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-03/ - 第 4 章 NumPy 基础:数组和向量计算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-04/ - 第 5 章 Pandas 入门: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-05/ - 第 6 章 数据加载、存储与文件格式: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-06/ - 第 7 章 数据清洗和准备: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-07/ - 第 10 章 数据聚合与分组运算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-10/ - 第 11 章 时间序列: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-11/ - 第 12 章 pandas 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-12/ - 第 13 章 Python 建模库介绍: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-13/ - 第 14 章 数据分析案例: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-14/ - 附录 A NumPy 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-A/ - 附录 B 更多关于 IPython 的内容: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-B/ - 第 8 章 数据规整:聚合、合并和重塑: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-08/ - 第 9 章 绘图和可视化: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-09/ - Polars用户指南: https://www.pythondataanalysis.com/docs/Polars_user_guide/ - Polars入门: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_getting_started/ - 安装Polars: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_installation/ - Polars核心概念: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/ - Polars数据类型和结构: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/data-types-and-structures/ - Polars表达式和上下文: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/expressions-and-contexts/ - Polars延迟API: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/lazy-api/ - Streaming: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/_streaming/ - Polars表达式: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/ - Polars基本操作: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/basic-operations/ - Aggregation: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/aggregation/ - Casting: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/casting/ - Categorical Data and Enums: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/categorical-data-and-enums/ - Expression Expansion: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/expression-expansion/ - Folds: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/folds/ - Lists and Arrays: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/lists-and-arrays/ - Missing Data: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/missing-data/ - Numpy Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/numpy-functions/ - Strings: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/strings/ - Structs: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/structs/ - User Defined Python Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/user-defined-python-functions/ - Window Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/window-functions/ - Reference: https://www.pythondataanalysis.com/docs/Polars_user_guide/api/reference/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/contributing/ - Versioning: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/versioning/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars-cloud/ - Ecosystem: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/ecosystem/ - Gpu Support: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/gpu-support/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/io/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/lazy/ - Pandas: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/pandas/ - Spark: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/spark/ - Arrow: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/arrow/ - Comparison: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/comparison/ - Multiprocessing: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/multiprocessing/ - Polars Llms: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/polars_llms/ - Styling: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/styling/ - Visualization: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/visualization/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/plugins/ - Create: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/create/ - Cte: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/cte/ - Intro: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/intro/ - Select: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/select/ - Show: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/show/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/transformations/ # Structs The data type `Struct` is a composite data type that can store multiple fields in a single column. !!! tip "Python analogy" For Python users, the data type `Struct` is kind of like a Python dictionary. Even better, if you are familiar with Python typing, you can think of the data type `Struct` as `typing.TypedDict`. In this page of the user guide we will see situations in which the data type `Struct` arises, we will understand why it does arise, and we will see how to work with `Struct` values. Let's start with a dataframe that captures the average rating of a few movies across some states in the US: {{code_block('user-guide/expressions/structs','ratings_df',['DataFrame'])}} ```python exec="on" result="text" session="expressions/structs" --8<-- "python/user-guide/expressions/structs.py:ratings_df" ``` ## Encountering the data type `Struct` A common operation that will lead to a `Struct` column is the ever so popular `value_counts` function that is commonly used in exploratory data analysis. Checking the number of times a state appears in the data is done as so: {{code_block('user-guide/expressions/structs','state_value_counts',['value_counts'])}} ```python exec="on" result="text" session="expressions/structs" --8<-- "python/user-guide/expressions/structs.py:state_value_counts" ``` Quite unexpected an output, especially if coming from tools that do not have such a data type. We're not in peril, though. To get back to a more familiar output, all we need to do is use the function `unnest` on the `Struct` column: {{code_block('user-guide/expressions/structs','struct_unnest',['unnest'])}} ```python exec="on" result="text" session="expressions/structs" --8<-- "python/user-guide/expressions/structs.py:struct_unnest" ``` The function `unnest` will turn each field of the `Struct` into its own column. !!! note "Why `value_counts` returns a `Struct`" Polars expressions always operate on a single series and return another series. `Struct` is the data type that allows us to provide multiple columns as input to an expression, or to output multiple columns from an expression. Thus, we can use the data type `Struct` to specify each value and its count when we use `value_counts`. ## Inferring the data type `Struct` from dictionaries When building series or dataframes, Polars will convert dictionaries to the data type `Struct`: {{code_block('user-guide/expressions/structs','series_struct',['Series'])}} ```python exec="on" result="text" session="expressions/structs" --8<-- "python/user-guide/expressions/structs.py:series_struct" ``` The number of fields, their names, and their types, are inferred from the first dictionary seen. Subsequent incongruences can result in `null` values or in errors: {{code_block('user-guide/expressions/structs','series_struct_error',['Series'])}} ```python exec="on" result="text" session="expressions/structs" --8<-- "python/user-guide/expressions/structs.py:series_struct_error" ``` ## Extracting individual values of a `Struct` Let's say that we needed to obtain just the field `"Movie"` from the `Struct` in the series that we created above. We can use the function `field` to do so: {{code_block('user-guide/expressions/structs','series_struct_extract',['struct.field'])}} ```python exec="on" result="text" session="expressions/structs" --8<-- "python/user-guide/expressions/structs.py:series_struct_extract" ``` ## Renaming individual fields of a `Struct` What if we need to rename individual fields of a `Struct` column? We use the function `rename_fields`: {{code_block('user-guide/expressions/structs','series_struct_rename',['struct.rename_fields'])}} ```python exec="on" result="text" session="expressions/structs" --8<-- "python/user-guide/expressions/structs.py:series_struct_rename" ``` To be able to actually see that the field names were change, we will create a dataframe where the only column is the result and then we use the function `unnest` so that each field becomes its own column. The column names will reflect the renaming operation we just did: {{code_block('user-guide/expressions/structs','struct-rename-check',['struct.rename_fields'])}} ```python exec="on" result="text" session="expressions/structs" --8<-- "python/user-guide/expressions/structs.py:struct-rename-check" ``` ## Practical use-cases of `Struct` columns ### Identifying duplicate rows Let's get back to the `ratings` data. We want to identify cases where there are duplicates at a “Movie” and “Theatre” level. This is where the data type `Struct` shines: {{code_block('user-guide/expressions/structs','struct_duplicates',['is_duplicated', 'struct'])}} ```python exec="on" result="text" session="expressions/structs" --8<-- "python/user-guide/expressions/structs.py:struct_duplicates" ``` We can identify the unique cases at this level also with `is_unique`! ### Multi-column ranking Suppose, given that we know there are duplicates, we want to choose which rating gets a higher priority. We can say that the column “Count” is the most important, and if there is a tie in the column “Count” then we consider the column “Avg_Rating”. We can then do: {{code_block('user-guide/expressions/structs','struct_ranking',['is_duplicated', 'struct'])}} ```python exec="on" result="text" session="expressions/structs" --8<-- "python/user-guide/expressions/structs.py:struct_ranking" ``` That's a pretty complex set of requirements done very elegantly in Polars! To learn more about the function `over`, used above, [see the user guide section on window functions](window-functions.md). ### Using multiple columns in a single expression As mentioned earlier, the data type `Struct` is also useful if you need to pass multiple columns as input to an expression. As an example, suppose we want to compute [the Ackermann function](https://en.wikipedia.org/wiki/Ackermann_function) on two columns of a dataframe. There is no way of composing Polars expressions to compute the Ackermann function[^1], so we define a custom function: {{code_block('user-guide/expressions/structs', 'ack', [])}} ```python exec="on" result="text" session="expressions/structs" --8<-- "python/user-guide/expressions/structs.py:ack" ``` Now, to compute the values of the Ackermann function on those arguments, we start by creating a `Struct` with fields `m` and `n` and then use the function `map_elements` to apply the function `ack` to each value: {{code_block('user-guide/expressions/structs','struct-ack',[], ['map_elements'], [])}} ```python exec="on" result="text" session="expressions/structs" --8<-- "python/user-guide/expressions/structs.py:struct-ack" ``` Refer to [this section of the user guide to learn more about applying user-defined Python functions to your data](user-defined-python-functions.md). [^1]: To say that something cannot be done is quite a bold claim. If you prove us wrong, please let us know!