Python数据分析 - PolarsBook中文版: https://www.pythondataanalysis.com/docs/polars_book_cn/ - Polars快速入门: https://www.pythondataanalysis.com/docs/polars_book_cn/quickstart/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/expressions/ - Polars上下文: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/contexts/ - Polars分组: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/groupby/ - Polars折叠: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/folds/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/custom_functions/ - Polars实例: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/introduction_polars/ - Polars表达式方法: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/api/ - Polars视频介绍: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/video_intro/ - Polars与Numpy交互: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/numpy/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/window_functions/ - Polars索引: https://www.pythondataanalysis.com/docs/polars_book_cn/indexing/ - Polars数据类型: https://www.pythondataanalysis.com/docs/polars_book_cn/datatypes/ - 来自Pandas: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_pandas/ - 来自ApacheSpark: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_spark/ - Polars性能: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/strings/ - Polars优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/ - Polars惰性方法: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/ - 谓词下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/predicate-pushdown/ - 投影下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/projection-pushdown/ - 其它优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/other-optimizations/ - Polars参考指南: https://www.pythondataanalysis.com/docs/polars_book_cn/references/ - Polars时间序列: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/ - Polars时间序列实例: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/time-series/ - Polars使用范围: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/ - IO: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/ - Polars操作CSV文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/csv/ - Polars操作Parquet文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/parquet/ - Polars处理多个文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/multiple_files/ - Polars读取数据库: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/read_db/ - Polars与AWS交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/aws/ - Polars与Google BigQuery交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/google-big-query/ - Polars与Postgres交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/postgres/ - 互通性: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/ - Arrow: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/arrow/ - Numpy: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/numpy/ - 数据: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/strings/ - 时间戳: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/timestamps/ - 数据帧: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/ - 选中行或列: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/row_col_selection/ - 常用操作: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/common-manipulations/ - 聚合: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/aggregate/ - 分组: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/groupby/ - 过滤: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/filter/ - 连接: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/join/ - 重塑: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/melt/ - 条件应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/conditionally-apply/ - 排序: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/sorting/ - 透视: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/pivot/ - 应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/udfs/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/window-functions/ - Python数据分析 第二版: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/ - 第 1 章 准备工作: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-01/ - 第 2 章 Python 语法基础: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-02/ - 第 3 章 Python 的数据结构、函数和文件: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-03/ - 第 4 章 NumPy 基础:数组和向量计算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-04/ - 第 5 章 Pandas 入门: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-05/ - 第 6 章 数据加载、存储与文件格式: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-06/ - 第 7 章 数据清洗和准备: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-07/ - 第 10 章 数据聚合与分组运算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-10/ - 第 11 章 时间序列: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-11/ - 第 12 章 pandas 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-12/ - 第 13 章 Python 建模库介绍: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-13/ - 第 14 章 数据分析案例: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-14/ - 附录 A NumPy 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-A/ - 附录 B 更多关于 IPython 的内容: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-B/ - 第 8 章 数据规整:聚合、合并和重塑: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-08/ - 第 9 章 绘图和可视化: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-09/ - Polars用户指南: https://www.pythondataanalysis.com/docs/Polars_user_guide/ - Polars入门: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_getting_started/ - 安装Polars: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_installation/ - Polars核心概念: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/ - Polars数据类型和结构: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/data-types-and-structures/ - Polars表达式和上下文: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/expressions-and-contexts/ - Polars延迟API: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/lazy-api/ - Streaming: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/_streaming/ - Polars表达式: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/ - Polars基本操作: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/basic-operations/ - Aggregation: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/aggregation/ - Casting: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/casting/ - Categorical Data and Enums: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/categorical-data-and-enums/ - Expression Expansion: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/expression-expansion/ - Folds: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/folds/ - Lists and Arrays: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/lists-and-arrays/ - Missing Data: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/missing-data/ - Numpy Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/numpy-functions/ - Strings: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/strings/ - Structs: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/structs/ - User Defined Python Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/user-defined-python-functions/ - Window Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/window-functions/ - Reference: https://www.pythondataanalysis.com/docs/Polars_user_guide/api/reference/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/contributing/ - Versioning: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/versioning/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars-cloud/ - Ecosystem: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/ecosystem/ - Gpu Support: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/gpu-support/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/io/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/lazy/ - Pandas: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/pandas/ - Spark: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/spark/ - Arrow: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/arrow/ - Comparison: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/comparison/ - Multiprocessing: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/multiprocessing/ - Polars Llms: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/polars_llms/ - Styling: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/styling/ - Visualization: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/visualization/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/plugins/ - Create: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/create/ - Cte: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/cte/ - Intro: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/intro/ - Select: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/select/ - Show: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/show/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/transformations/ # 常用操作 与许多其他数据框架库一样,Polars 提供了大量的常用函数来对 Dataframe 进行操作。 熟悉 Dataframes 的用户会发现 Polars 与 `Pandas` 或 `R` 的实现有许多相似之处。 ## 添加列 ```python out = df.with_columns(pl.Series(["p", "q", "r", "s", "t"]).alias("e")) # .alias方法增加一列 print(out) ``` ```text shape: (5, 5) ┌──────┬──────┬──────────┬─────┬─────┐ │ a ┆ b ┆ c ┆ d ┆ e │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ str │ ╞══════╪══════╪══════════╪═════╪═════╡ │ 1 ┆ foo ┆ 0.37454 ┆ a ┆ p │ │ 2 ┆ ham ┆ 0.950714 ┆ b ┆ q │ │ 3 ┆ spam ┆ 0.731994 ┆ c ┆ r │ │ null ┆ egg ┆ 0.598658 ┆ d ┆ s │ │ 5 ┆ null ┆ 0.156019 ┆ e ┆ t │ └──────┴──────┴──────────┴─────┴─────┘ ``` ## 类型转换 这个例子使用的是 Python 数据类型,但我们也可以在 Polars `dtypes` (如 `pl.Float32`、`pl.Float64`)之间进行转换。 ```python out = df.with_columns(pl.col("a").cast(float)) print(out) ``` ```text shape: (5, 4) ┌──────┬──────┬──────────┬─────┐ │ a ┆ b ┆ c ┆ d │ │ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ f64 ┆ str │ ╞══════╪══════╪══════════╪═════╡ │ 1.0 ┆ foo ┆ 0.37454 ┆ a │ │ 2.0 ┆ ham ┆ 0.950714 ┆ b │ │ 3.0 ┆ spam ┆ 0.731994 ┆ c │ │ null ┆ egg ┆ 0.598658 ┆ d │ │ 5.0 ┆ null ┆ 0.156019 ┆ e │ └──────┴──────┴──────────┴─────┘ ``` ## 重命名列 ```python import numpy as np import polars as pl df = pl.DataFrame( { "a": [1, 2, 3, None, 5], "b": ["foo", "ham", "spam", "egg", None], "c": np.random.rand(5), "d": ["a", "b", "c", "d", "e"], } ) df.columns = ["banana", "orange", "apple", "grapefruit"] # 重命名列 ``` ```text shape: (5, 4) ┌────────┬────────┬──────────┬────────────┐ │ banana ┆ orange ┆ apple ┆ grapefruit │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str │ ╞════════╪════════╪══════════╪════════════╡ │ 1 ┆ foo ┆ 0.155995 ┆ a │ │ 2 ┆ ham ┆ 0.058084 ┆ b │ │ 3 ┆ spam ┆ 0.866176 ┆ c │ │ null ┆ egg ┆ 0.601115 ┆ d │ │ 5 ┆ null ┆ 0.708073 ┆ e │ └────────┴────────┴──────────┴────────────┘ ``` ## 删除列 ```python # 删除单独的列 out = df.drop("d") # 删除多个列 out = df.drop(["b", "c"]) # 选择所有列但是不包括('b', 'c') out = df.select(pl.all().exclude(["b", "c"])) # 仅选择列"a" out = df.select(pl.col("a")) ``` ```text shape: (5, 1) ┌──────┐ │ a │ │ --- │ │ i64 │ ╞══════╡ │ 1 │ │ 2 │ │ 3 │ │ null │ │ 5 │ └──────┘ ``` ## 删除空值 ```python df.drop_nulls() ``` ```text shape: (3, 4) ┌─────┬──────┬──────────┬─────┐ │ a ┆ b ┆ c ┆ d │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str │ ╞═════╪══════╪══════════╪═════╡ │ 1 ┆ foo ┆ 0.37454 ┆ a │ │ 2 ┆ ham ┆ 0.950714 ┆ b │ │ 3 ┆ spam ┆ 0.731994 ┆ c │ └─────┴──────┴──────────┴─────┘ ``` ## 填充缺失值(NA) 策略: - `mean`:平均值 - `backward`:上一值 - `min`:最小值 - `max`:最大值 ```python df.fill_none("forward") ``` ```text shape: (5, 4) ┌─────┬──────┬──────────┬─────┐ │ a ┆ b ┆ c ┆ d │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str │ ╞═════╪══════╪══════════╪═════╡ │ 1 ┆ foo ┆ 0.37454 ┆ a │ │ 2 ┆ ham ┆ 0.950714 ┆ b │ │ 3 ┆ spam ┆ 0.731994 ┆ c │ │ 3 ┆ egg ┆ 0.598658 ┆ d │ │ 5 ┆ egg ┆ 0.156019 ┆ e │ └─────┴──────┴──────────┴─────┘ ``` ## 获取所有列 ```python df.columns ``` ```text ['a', 'b', 'c', 'd'] ``` ## 空值计数 ```python df.null_count() ``` ```text shape: (1, 4) ┌─────┬─────┬─────┬─────┐ │ a ┆ b ┆ c ┆ d │ │ --- ┆ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 ┆ u32 │ ╞═════╪═════╪═════╪═════╡ │ 1 ┆ 1 ┆ 0 ┆ 0 │ └─────┴─────┴─────┴─────┘ ``` ## 列排序 ```python df.sort("a", reverse=True) ``` ```text shape: (5, 4) ┌──────┬──────┬──────────┬─────┐ │ a ┆ b ┆ c ┆ d │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str │ ╞══════╪══════╪══════════╪═════╡ │ null ┆ egg ┆ 0.598658 ┆ d │ │ 5 ┆ null ┆ 0.156019 ┆ e │ │ 3 ┆ spam ┆ 0.731994 ┆ c │ │ 2 ┆ ham ┆ 0.950714 ┆ b │ │ 1 ┆ foo ┆ 0.37454 ┆ a │ └──────┴──────┴──────────┴─────┘ ``` ## 转为 NumPy ```python df.to_numpy() ``` ```text [[1.0 'foo' 0.3745401188473625 'a'] [2.0 'ham' 0.9507143064099162 'b'] [3.0 'spam' 0.7319939418114051 'c'] [nan 'egg' 0.5986584841970366 'd'] [5.0 None 0.15601864044243652 'e']] ``` ## 转为 Pandas ```python df.to_pandas() ``` ```text a b c d 0 1.0 foo 0.374540 a 1 2.0 ham 0.950714 b 2 3.0 spam 0.731994 c 3 NaN egg 0.598658 d 4 5.0 None 0.156019 e ```