Python数据分析 - PolarsBook中文版: https://www.pythondataanalysis.com/docs/polars_book_cn/ - Polars快速入门: https://www.pythondataanalysis.com/docs/polars_book_cn/quickstart/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/expressions/ - Polars上下文: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/contexts/ - Polars分组: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/groupby/ - Polars折叠: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/folds/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/custom_functions/ - Polars实例: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/introduction_polars/ - Polars表达式方法: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/api/ - Polars视频介绍: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/video_intro/ - Polars与Numpy交互: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/numpy/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/window_functions/ - Polars索引: https://www.pythondataanalysis.com/docs/polars_book_cn/indexing/ - Polars数据类型: https://www.pythondataanalysis.com/docs/polars_book_cn/datatypes/ - 来自Pandas: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_pandas/ - 来自ApacheSpark: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_spark/ - Polars性能: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/strings/ - Polars优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/ - Polars惰性方法: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/ - 谓词下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/predicate-pushdown/ - 投影下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/projection-pushdown/ - 其它优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/other-optimizations/ - Polars参考指南: https://www.pythondataanalysis.com/docs/polars_book_cn/references/ - Polars时间序列: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/ - Polars时间序列实例: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/time-series/ - Polars使用范围: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/ - IO: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/ - Polars操作CSV文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/csv/ - Polars操作Parquet文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/parquet/ - Polars处理多个文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/multiple_files/ - Polars读取数据库: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/read_db/ - Polars与AWS交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/aws/ - Polars与Google BigQuery交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/google-big-query/ - Polars与Postgres交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/postgres/ - 互通性: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/ - Arrow: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/arrow/ - Numpy: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/numpy/ - 数据: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/strings/ - 时间戳: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/timestamps/ - 数据帧: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/ - 选中行或列: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/row_col_selection/ - 常用操作: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/common-manipulations/ - 聚合: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/aggregate/ - 分组: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/groupby/ - 过滤: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/filter/ - 连接: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/join/ - 重塑: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/melt/ - 条件应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/conditionally-apply/ - 排序: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/sorting/ - 透视: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/pivot/ - 应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/udfs/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/window-functions/ - Python数据分析 第二版: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/ - 第 1 章 准备工作: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-01/ - 第 2 章 Python 语法基础: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-02/ - 第 3 章 Python 的数据结构、函数和文件: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-03/ - 第 4 章 NumPy 基础:数组和向量计算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-04/ - 第 5 章 Pandas 入门: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-05/ - 第 6 章 数据加载、存储与文件格式: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-06/ - 第 7 章 数据清洗和准备: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-07/ - 第 10 章 数据聚合与分组运算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-10/ - 第 11 章 时间序列: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-11/ - 第 12 章 pandas 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-12/ - 第 13 章 Python 建模库介绍: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-13/ - 第 14 章 数据分析案例: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-14/ - 附录 A NumPy 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-A/ - 附录 B 更多关于 IPython 的内容: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-B/ - 第 8 章 数据规整:聚合、合并和重塑: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-08/ - 第 9 章 绘图和可视化: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-09/ - Polars用户指南: https://www.pythondataanalysis.com/docs/Polars_user_guide/ - Polars入门: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_getting_started/ - 安装Polars: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_installation/ - Polars核心概念: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/ - Polars数据类型和结构: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/data-types-and-structures/ - Polars表达式和上下文: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/expressions-and-contexts/ - Polars延迟API: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/lazy-api/ - Streaming: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/_streaming/ - Polars表达式: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/ - Polars基本操作: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/basic-operations/ - Aggregation: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/aggregation/ - Casting: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/casting/ - Categorical Data and Enums: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/categorical-data-and-enums/ - Expression Expansion: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/expression-expansion/ - Folds: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/folds/ - Lists and Arrays: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/lists-and-arrays/ - Missing Data: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/missing-data/ - Numpy Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/numpy-functions/ - Strings: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/strings/ - Structs: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/structs/ - User Defined Python Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/user-defined-python-functions/ - Window Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/window-functions/ - Reference: https://www.pythondataanalysis.com/docs/Polars_user_guide/api/reference/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/contributing/ - Versioning: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/versioning/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars-cloud/ - Ecosystem: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/ecosystem/ - Gpu Support: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/gpu-support/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/io/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/lazy/ - Pandas: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/pandas/ - Spark: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/spark/ - Arrow: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/arrow/ - Comparison: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/comparison/ - Multiprocessing: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/multiprocessing/ - Polars Llms: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/polars_llms/ - Styling: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/styling/ - Visualization: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/visualization/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/plugins/ - Create: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/create/ - Cte: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/cte/ - Intro: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/intro/ - Select: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/select/ - Show: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/show/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/transformations/ # Polars入门 本章旨在帮助你开始使用Polars。它涵盖了该库的所有基本特性和功能,让新用户能够轻松熟悉从初始安装和设置到核心功能等基础知识。如果你已经是高级用户,或者熟悉数据框,那么可以直接跳至下一章[安装Polars](/docs/Polars_user_guide/polars_installation/)。 ## 安装Polars ```bash pip install polars ``` ## 读取 & 写入 Polars支持读取和写入常见的文件格式(例如,CSV、JSON、Parquet)、云存储(亚马逊S3、微软Azure Blob、谷歌BigQuery)以及数据库(例如,PostgreSQL、MySQL)。下面,我们创建一个小的数据框,并展示如何将其写入磁盘以及再读取回来。 ```python import polars as pl import datetime as dt df = pl.DataFrame( { "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"], "birthdate": [ dt.date(1997, 1, 10), dt.date(1985, 2, 15), dt.date(1983, 3, 22), dt.date(1981, 4, 30), ], "weight": [57.9, 72.5, 53.6, 83.1], # (kg) "height": [1.56, 1.77, 1.65, 1.75], # (m) } ) print(df) ``` ``` shape: (4, 4) ┌────────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │ └────────────────┴────────────┴────────┴────────┘ ``` 有关CSV文件格式和其他数据格式的更多示例,请参阅用户指南的[输入输出](io/index.md)部分。 ## 表达式与上下文 _表达式_ 是 Polars 的主要优势之一,因为它们提供了一种模块化且灵活的方式来表示数据转换。 以下是一个 Polars 表达式的示例: ```py pl.col("weight") / (pl.col("height") ** 2) ``` 正如你可能猜到的,这个表达式获取名为“体重(weight)”的列,并将该列中的值除以“身高(height)”列中值的平方,以此计算出一个人的身体质量指数(BMI)。请注意,上面的代码表达的是一种抽象的计算:只有在 Polars 的上下文环境中,这个表达式才会具体化为一个包含计算结果的序列。 下面,我们将展示在不同上下文中的Polars表达式示例: - `select` - `with_columns` - `filter` - `group_by` 若要更详细地探究[表达式和上下文](concepts/expressions-and-contexts.md),请参阅相应的用户指南部分。 ### `select` 上下文`select`功能使你能够从数据框中选择和处理列。在最简单的情况下,你提供的每个表达式都将映射到结果数据框中的一列: ```py result = df.select( pl.col("name"), pl.col("birthdate").dt.year().alias("birth_year"), (pl.col("weight") / (pl.col("height") ** 2)).alias("bmi"), ) print(result) ``` ``` shape: (4, 3) ┌────────────────┬────────────┬───────────┐ │ name ┆ birth_year ┆ bmi │ │ --- ┆ --- ┆ --- │ │ str ┆ i32 ┆ f64 │ ╞════════════════╪════════════╪═══════════╡ │ Alice Archer ┆ 1997 ┆ 23.791913 │ │ Ben Brown ┆ 1985 ┆ 23.141498 │ │ Chloe Cooper ┆ 1983 ┆ 19.687787 │ │ Daniel Donovan ┆ 1981 ┆ 27.134694 │ └────────────────┴────────────┴───────────┘ ``` Polars 还支持一项名为“表达式展开”的功能,在该功能中,一个表达式可作为多个表达式的简写形式。在下面的示例中,我们使用表达式展开功能,通过单个表达式来处理“体重(weight)”列和“身高(height)”列。在使用表达式展开时,你可以使用 `.name.suffix` 为原始列的名称添加后缀: ```py result = df.select( pl.col("name"), (pl.col("weight", "height") * 0.95).round(2).name.suffix("-5%"), ) print(result) ``` ``` shape: (4, 3) ┌────────────────┬───────────┬───────────┐ │ name ┆ weight-5% ┆ height-5% │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞════════════════╪═══════════╪═══════════╡ │ Alice Archer ┆ 55.01 ┆ 1.48 │ │ Ben Brown ┆ 68.88 ┆ 1.68 │ │ Chloe Cooper ┆ 50.92 ┆ 1.57 │ │ Daniel Donovan ┆ 78.94 ┆ 1.66 │ └────────────────┴───────────┴───────────┘ ``` 你可以查看用户指南的其他部分,以了解更多关于表达式展开中的[基本操作](expressions/basic-operations.md)或[列选择](expressions/expression-expansion.md)的内容。 ### `with_columns` `with_columns` 上下文与 `select` 上下文非常相似,但 `with_columns` 是向dataframe中添加列,而不是选择列。请注意,最终得到的dataframe如何包含原始dataframe的四列,再加上由 `with_columns` 内部的表达式所引入的两列: ```py result = df.with_columns( birth_year=pl.col("birthdate").dt.year(), bmi=pl.col("weight") / (pl.col("height") ** 2), ) print(result) ``` ``` shape: (4, 6) ┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐ │ name ┆ birthdate ┆ weight ┆ height ┆ birth_year ┆ bmi │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 ┆ i32 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 ┆ 1997 ┆ 23.791913 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 ┆ 1985 ┆ 23.141498 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 ┆ 1983 ┆ 19.687787 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 ┆ 1981 ┆ 27.134694 │ └────────────────┴────────────┴────────┴────────┴────────────┴───────────┘ ``` 在上述示例中,我们还决定使用具名表达式,而不是使用方法别名来指定新列的名称。像 `select` 和 `group_by` 这样的其他上下文环境也接受具名表达式。 ### `filter` `filter`(过滤)上下文使我们能够创建一个新的dataframe,该dataframe包含原始dataframe中部分行的数据子集。 ```python result = df.filter(pl.col("birthdate").dt.year() < 1990) print(result) ``` ``` shape: (3, 4) ┌────────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╡ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │ └────────────────┴────────────┴────────┴────────┘ ``` 你也可以将多个谓词表达式作为独立的参数提供,这比将它们全部用“&”组合在一起更加方便。 ```py result = df.filter( pl.col("birthdate").is_between(dt.date(1982, 12, 31), dt.date(1996, 1, 1)), pl.col("height") > 1.7, ) print(result) ``` ``` shape: (1, 4) ┌───────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞═══════════╪════════════╪════════╪════════╡ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ └───────────┴────────────┴────────┴────────┘ ``` ### `group_by` `group_by`(分组)上下文可用于将dataframe中在一个或多个表达式上具有相同值的行归为一组。下面的示例统计了每个年代出生的人数: ```python result = df.group_by( (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"), maintain_order=True, ).len() print(result) ``` ``` shape: (2, 2) ┌────────┬─────┐ │ decade ┆ len │ │ --- ┆ --- │ │ i32 ┆ u32 │ ╞════════╪═════╡ │ 1990 ┆ 1 │ │ 1980 ┆ 3 │ └────────┴─────┘ ``` 关键字参数 `maintain_order` 会强制 Polars 按照这些组在原始dataframe中出现的顺序来呈现分组结果。这会降低分组操作的速度,但在此处使用它是为了确保示例的可重复性。 在使用了 `group_by` 上下文之后,我们可以使用 `agg` 对得到的各个组进行聚合计算: ```python result = df.group_by( (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"), maintain_order=True, ).agg( pl.len().alias("sample_size"), pl.col("weight").mean().round(2).alias("avg_weight"), pl.col("height").max().alias("tallest"), ) print(result) ``` ``` shape: (2, 4) ┌────────┬─────────────┬────────────┬─────────┐ │ decade ┆ sample_size ┆ avg_weight ┆ tallest │ │ --- ┆ --- ┆ --- ┆ --- │ │ i32 ┆ u32 ┆ f64 ┆ f64 │ ╞════════╪═════════════╪════════════╪═════════╡ │ 1990 ┆ 1 ┆ 57.9 ┆ 1.56 │ │ 1980 ┆ 3 ┆ 69.73 ┆ 1.77 │ └────────┴─────────────┴────────────┴─────────┘ ``` ### 更复杂的查询 上下文以及其中的表达式可以根据你的需求进行链式组合,以创建更复杂的查询。在下面的示例中,我们将目前所了解到的一些上下文组合起来,构建一个更复杂的查询: ```python result = ( df.with_columns( (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"), pl.col("name").str.split(by=" ").list.first(), ) .select( pl.all().exclude("birthdate"), ) .group_by( pl.col("decade"), maintain_order=True, ) .agg( pl.col("name"), pl.col("weight", "height").mean().round(2).name.prefix("avg_"), ) ) print(result) ``` ``` shape: (2, 4) ┌────────┬────────────────────────────┬────────────┬────────────┐ │ decade ┆ name ┆ avg_weight ┆ avg_height │ │ --- ┆ --- ┆ --- ┆ --- │ │ i32 ┆ list[str] ┆ f64 ┆ f64 │ ╞════════╪════════════════════════════╪════════════╪════════════╡ │ 1990 ┆ ["Alice"] ┆ 57.9 ┆ 1.56 │ │ 1980 ┆ ["Ben", "Chloe", "Daniel"] ┆ 69.73 ┆ 1.72 │ └────────┴────────────────────────────┴────────────┴────────────┘ ``` ## 合并dataframes Polars提供了许多用于合并两个dataframes的工具。在本节中,我们将展示一个连接dataframes的示例以及一个拼接dataframes的示例。 ### 连接dataframes Polars提供了多种不同的连接算法。下面的示例展示了,当某一列可以用作唯一标识符来建立两个dataframes中行之间的对应关系时,如何使用左外连接来合并两个数据框: ```python df2 = pl.DataFrame( { "name": ["Ben Brown", "Daniel Donovan", "Alice Archer", "Chloe Cooper"], "parent": [True, False, False, False], "siblings": [1, 2, 3, 4], } ) print(df.join(df2, on="name", how="left")) ``` ``` shape: (4, 6) ┌────────────────┬────────────┬────────┬────────┬────────┬──────────┐ │ name ┆ birthdate ┆ weight ┆ height ┆ parent ┆ siblings │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 ┆ bool ┆ i64 │ ╞════════════════╪════════════╪════════╪════════╪════════╪══════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 ┆ false ┆ 3 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 ┆ true ┆ 1 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 ┆ false ┆ 4 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 ┆ false ┆ 2 │ └────────────────┴────────────┴────────┴────────┴────────┴──────────┘ ``` Polars提供了许多不同的连接算法,你可以在用户指南的[连接](transformations/joins.md)部分了解这些算法。 ### 拼接dataframes 拼接dataframes会创建一个更高或更宽的dataframes,具体取决于所使用的方法。假设我们有另一个包含其他人数据的dataframes,我们可以使用垂直拼接来创建一个更高的dataframes: ```python df3 = pl.DataFrame( { "name": ["Ethan Edwards", "Fiona Foster", "Grace Gibson", "Henry Harris"], "birthdate": [ dt.date(1977, 5, 10), dt.date(1975, 6, 23), dt.date(1973, 7, 22), dt.date(1971, 8, 3), ], "weight": [67.9, 72.5, 57.6, 93.1], # (kg) "height": [1.76, 1.6, 1.66, 1.8], # (m) } ) print(pl.concat([df, df3], how="vertical")) ``` ``` shape: (8, 4) ┌────────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │ │ Ethan Edwards ┆ 1977-05-10 ┆ 67.9 ┆ 1.76 │ │ Fiona Foster ┆ 1975-06-23 ┆ 72.5 ┆ 1.6 │ │ Grace Gibson ┆ 1973-07-22 ┆ 57.6 ┆ 1.66 │ │ Henry Harris ┆ 1971-08-03 ┆ 93.1 ┆ 1.8 │ └────────────────┴────────────┴────────┴────────┘ ``` Polars 提供了垂直拼接、水平拼接以及对角拼接功能。你可以在用户指南的[拼接](transformations/concatenation.md)部分了解更多关于这些拼接方式的信息。