Python数据分析 - PolarsBook中文版: https://www.pythondataanalysis.com/docs/polars_book_cn/ - Polars快速入门: https://www.pythondataanalysis.com/docs/polars_book_cn/quickstart/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/ - Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/expressions/ - Polars上下文: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/contexts/ - Polars分组: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/groupby/ - Polars折叠: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/folds/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/custom_functions/ - Polars实例: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/introduction_polars/ - Polars表达式方法: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/api/ - Polars视频介绍: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/video_intro/ - Polars与Numpy交互: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/numpy/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/window_functions/ - Polars索引: https://www.pythondataanalysis.com/docs/polars_book_cn/indexing/ - Polars数据类型: https://www.pythondataanalysis.com/docs/polars_book_cn/datatypes/ - 来自Pandas: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_pandas/ - 来自ApacheSpark: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_spark/ - Polars性能: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/strings/ - Polars优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/ - Polars惰性方法: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/ - 谓词下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/predicate-pushdown/ - 投影下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/projection-pushdown/ - 其它优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/other-optimizations/ - Polars参考指南: https://www.pythondataanalysis.com/docs/polars_book_cn/references/ - Polars时间序列: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/ - Polars时间序列实例: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/time-series/ - Polars使用范围: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/ - IO: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/ - Polars操作CSV文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/csv/ - Polars操作Parquet文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/parquet/ - Polars处理多个文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/multiple_files/ - Polars读取数据库: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/read_db/ - Polars与AWS交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/aws/ - Polars与Google BigQuery交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/google-big-query/ - Polars与Postgres交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/postgres/ - 互通性: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/ - Arrow: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/arrow/ - Numpy: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/numpy/ - 数据: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/ - 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/strings/ - 时间戳: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/timestamps/ - 数据帧: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/ - 选中行或列: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/row_col_selection/ - 常用操作: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/common-manipulations/ - 聚合: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/aggregate/ - 分组: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/groupby/ - 过滤: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/filter/ - 连接: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/join/ - 重塑: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/melt/ - 条件应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/conditionally-apply/ - 排序: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/sorting/ - 透视: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/pivot/ - 应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/ - Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/udfs/ - Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/window-functions/ - Python数据分析 第二版: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/ - 第 1 章 准备工作: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-01/ - 第 2 章 Python 语法基础: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-02/ - 第 3 章 Python 的数据结构、函数和文件: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-03/ - 第 4 章 NumPy 基础:数组和向量计算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-04/ - 第 5 章 Pandas 入门: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-05/ - 第 6 章 数据加载、存储与文件格式: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-06/ - 第 7 章 数据清洗和准备: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-07/ - 第 10 章 数据聚合与分组运算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-10/ - 第 11 章 时间序列: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-11/ - 第 12 章 pandas 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-12/ - 第 13 章 Python 建模库介绍: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-13/ - 第 14 章 数据分析案例: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-14/ - 附录 A NumPy 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-A/ - 附录 B 更多关于 IPython 的内容: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-B/ - 第 8 章 数据规整:聚合、合并和重塑: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-08/ - 第 9 章 绘图和可视化: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-09/ - Polars用户指南: https://www.pythondataanalysis.com/docs/Polars_user_guide/ - Polars入门: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_getting_started/ - 安装Polars: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_installation/ - Polars核心概念: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/ - Polars数据类型和结构: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/data-types-and-structures/ - Polars表达式和上下文: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/expressions-and-contexts/ - Polars延迟API: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/lazy-api/ - Streaming: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/_streaming/ - Polars表达式: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/ - Polars基本操作: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/basic-operations/ - Aggregation: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/aggregation/ - Casting: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/casting/ - Categorical Data and Enums: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/categorical-data-and-enums/ - Expression Expansion: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/expression-expansion/ - Folds: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/folds/ - Lists and Arrays: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/lists-and-arrays/ - Missing Data: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/missing-data/ - Numpy Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/numpy-functions/ - Strings: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/strings/ - Structs: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/structs/ - User Defined Python Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/user-defined-python-functions/ - Window Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/window-functions/ - Reference: https://www.pythondataanalysis.com/docs/Polars_user_guide/api/reference/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/contributing/ - Versioning: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/versioning/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars-cloud/ - Ecosystem: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/ecosystem/ - Gpu Support: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/gpu-support/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/io/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/lazy/ - Pandas: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/pandas/ - Spark: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/spark/ - Arrow: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/arrow/ - Comparison: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/comparison/ - Multiprocessing: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/multiprocessing/ - Polars Llms: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/polars_llms/ - Styling: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/styling/ - Visualization: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/visualization/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/plugins/ - Create: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/create/ - Cte: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/cte/ - Intro: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/intro/ - Select: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/select/ - Show: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/show/ - Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/transformations/ # Polars时间序列 `Polars` 为时间序列重采样提供了强大的方法支持。许多人都知道 `Pandas` 中 `df.resample` 提供了重采样功能。 `Polars` 在以下两个方面与 `Pandas` 有所区别: - 上采样 (Up Sampling) - 下采样 (Down Sampling) ## 上采样 (Up Sampling) 上采样实际上相当于将一个日期范围与你的数据集进行左关联 (left join) 操作,并填充缺失数据。`Polars` 为此操作 提供了封装方法,你可以参考下面的一个示例。 ```python df = pl.DataFrame( { "time": pl.date_range(start=datetime(2021, 12, 16), end=datetime(2021, 12, 16, 3), interval="30m", eager=True), "groups": ["a", "a", "a", "b", "b", "a", "a"], "values": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0], } ) out1 = df.upsample(time_column="time", every="15m").fill_null(strategy="forward") out2 = df.upsample(time_column="time", every="15m").interpolate().fill_null(strategy="forward") ``` ## 下采样 (Down Sampling) 下采样很有意思。你需要处理日期间隔、窗口持续时间、聚合等问题。 `Polars` 将下采样视为分组(groupby)操作的一个特例,因此表达式 API 为分组(groupby)上下文(contexts)提供了两个额外的入口。 - [groupby_dynamic](POLARS_PY_REF_GUIDE/api/polars.DataFrame.groupby_dynamic.html) - [groupby_rolling](POLARS_PY_REF_GUIDE/api/polars.DataFrame.groupby_rolling.html) 你可以通过调用二者其中任何一个函数来获取对表达式方法的完整访问,它有着强大的性能! 让我们通过下面几个示例来理解这样做的意义。 ```python df = pl.DataFrame( { "time": pl.date_range( start=datetime(2021, 12, 16), end=datetime(2021, 12, 16, 3), interval="30m", eager=True, ), "groups": ["a", "a", "a", "b", "b", "a", "a"], } ) ``` ```python out = df.groupby_dynamic( "time", every="1h", closed="both", by="groups", include_boundaries=True, ).agg([pl.count()]) ``` ## 动态分组 (Groupby Dynamic) 在下面的一段代码中,我们以 **天** (`"1d"`) 为单位,把关于 2021 年的 `日期范围 (date range)` 创建为一个 `DataFrame`。 接下来,我们创建起始于每 **月** (`"1mo"`),长度为 `1` 个月的动态窗口 (dynamic windows)。动态窗口的大小并不由 `DataFrame` 中的行数决定,而是由一个时间单位 (temporal unit) 决定,比如一天 (`"1d"`),三周 (`"3w"`),亦或是五纳秒 (`"5ns"`) ... 希望这个例子有助于让你理解动态窗口的含义。 匹配某个动态窗口的值会被分配到该窗口所对应的组中,接下来你可以用强大的表达式方法进行聚合操作。 下面的示例使用 **groupby_dynamic** 来计算: - 距离月底的天数 - 一个月里的天数 ```python # 时间轴(从low到high,间隔为1天,轴名称为"time") df = pl.date_range(start=datetime(2021, 1, 1), end=datetime(2021, 12, 31), interval="1d", name="time", eager=True).to_frame() out = ( df.groupby_dynamic("time", every="1mo", period="1mo", closed="left") .agg( [ pl.col("time").cumcount().reverse().head(3).alias("day/eom"), ((pl.col("time") - pl.col("time").first()).last().dt.days() + 1).alias("days_in_month"), ] ) .explode("day/eom") ) print(out) ``` ```text shape: (36, 3) ┌─────────────────────┬─────────┬───────────────┐ │ time ┆ day/eom ┆ days_in_month │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ u32 ┆ i64 │ ╞═════════════════════╪═════════╪═══════════════╡ │ 2021-01-01 00:00:00 ┆ 30 ┆ 31 │ │ 2021-01-01 00:00:00 ┆ 29 ┆ 31 │ │ 2021-01-01 00:00:00 ┆ 28 ┆ 31 │ │ 2021-02-01 00:00:00 ┆ 27 ┆ 28 │ │ … ┆ … ┆ … │ │ 2021-11-01 00:00:00 ┆ 27 ┆ 30 │ │ 2021-12-01 00:00:00 ┆ 30 ┆ 31 │ │ 2021-12-01 00:00:00 ┆ 29 ┆ 31 │ │ 2021-12-01 00:00:00 ┆ 28 ┆ 31 │ └─────────────────────┴─────────┴───────────────┘ ``` 要定义一个动态窗口,需要提供以下三个参数: - **every**:窗口的时间间隔 - **period**:窗口的持续时间 - **offset**:可以对窗口的开始进行偏移 因为 _**every**_ 并不总是需要等于 _**period**_,我们可以用一种非常灵活的方式来创建很多组别。它们可以互相重叠,也可以在组间留出边界。 我们先从简单的例子开始 🥱 想想看下面几组参数会创建出怎么样的窗口。 > - every: 1 天 -> `"1d"` - period: 1 天 -> `"1d"` ```text 创建出的窗口相邻,且长度相等 |--| |--| |--| ``` > - every: 1 天 -> `"1d"` - period: 2 天 -> `"2d"` ```text 窗口之间有 1 天的重叠 |----| |----| |----| ``` > - every: 2 天 -> `"2d"` - period: 1 天 -> `"1d"` ```text 两个窗口之间留有间隔,在这段范围内的数据不属于任何一个组别 |--| |--| |--| ``` ## 滚动分组 (Rolling Groupby) 滚动分组是分组(groupby)上下文的另一个入口。但与 `groupby_dynamic` 不同的是,窗口的设置不接受参数 `every` 和 `period` —— 对于一个滚动分组,窗口不是固定的!它们由 `index_column` 中的值决定。 想象一下,你有一个值为`{2021-01-01, 20210-01-05}` 的时间序列,使用参数 `period="5d"` 将创建以下窗口: ```text 2021-01-01 2021-01-06 |----------| 2021-01-05 2021-01-10 |----------| ``` 由于滚动分组的窗口总是由 `DataFrame` 列中的值决定,组别的数目总是与原 `DataFrame` 相等。 ## 将动态分组与滚动分组结合起来 用正常的 groupby 操作,我们可以将这两种分组方式结合起来。 下面是一个使用动态分组的例子: ```python from datetime import datetime import polars as pl df = pl.DataFrame( { "time": pl.date_range( start=datetime(2021, 12, 16), end=datetime(2021, 12, 16, 3), interval="30m", eager=True, ), "groups": ["a", "a", "a", "b", "b", "a", "a"], } ) print(out) ``` ```text shape: (7, 2) ┌─────────────────────┬────────┐ │ time ┆ groups │ │ --- ┆ --- │ │ datetime[μs] ┆ str │ ╞═════════════════════╪════════╡ │ 2021-12-16 00:00:00 ┆ a │ │ 2021-12-16 00:30:00 ┆ a │ │ 2021-12-16 01:00:00 ┆ a │ │ 2021-12-16 01:30:00 ┆ b │ │ 2021-12-16 02:00:00 ┆ b │ │ 2021-12-16 02:30:00 ┆ a │ │ 2021-12-16 03:00:00 ┆ a │ └─────────────────────┴────────┘ ``` ```python # 动态分组 out = df.groupby_dynamic( "time", every="1h", closed="both", by="groups", include_boundaries=True, ).agg([pl.count()]) print(df) ``` ```text shape: (7, 5) ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────┐ │ groups ┆ _lower_boundary ┆ _upper_boundary ┆ time ┆ count │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ u32 │ ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════╡ │ a ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1 │ │ a ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3 │ │ a ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1 │ │ a ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2 │ │ a ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1 │ │ b ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2 │ │ b ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1 │ └────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────┘ ```