Python数据分析

- PolarsBook中文版: https://www.pythondataanalysis.com/docs/polars_book_cn/
- Polars快速入门: https://www.pythondataanalysis.com/docs/polars_book_cn/quickstart/
- Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/
- Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/expressions/
- Polars上下文: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/contexts/
- Polars分组: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/groupby/
- Polars折叠: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/folds/
- Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/custom_functions/
- Polars实例: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/introduction_polars/
- Polars表达式方法: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/api/
- Polars视频介绍: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/video_intro/
- Polars与Numpy交互: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/numpy/
- Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/window_functions/
- Polars索引: https://www.pythondataanalysis.com/docs/polars_book_cn/indexing/
- Polars数据类型: https://www.pythondataanalysis.com/docs/polars_book_cn/datatypes/
- 来自Pandas: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_pandas/
- 来自ApacheSpark: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_spark/
- Polars性能: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/
- 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/strings/
- Polars优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/
- Polars惰性方法: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/
- 谓词下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/predicate-pushdown/
- 投影下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/projection-pushdown/
- 其它优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/other-optimizations/
- Polars参考指南: https://www.pythondataanalysis.com/docs/polars_book_cn/references/
- Polars时间序列: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/
- Polars时间序列实例: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/time-series/
- Polars使用范围: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/
- IO: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/
- Polars操作CSV文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/csv/
- Polars操作Parquet文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/parquet/
- Polars处理多个文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/multiple_files/
- Polars读取数据库: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/read_db/
- Polars与AWS交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/aws/
- Polars与Google BigQuery交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/google-big-query/
- Polars与Postgres交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/postgres/
- 互通性: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/
- Arrow: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/arrow/
- Numpy: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/numpy/
- 数据: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/
- 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/strings/
- 时间戳: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/timestamps/
- 数据帧: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/
- 选中行或列: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/row_col_selection/
- 常用操作: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/common-manipulations/
- 聚合: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/aggregate/
- 分组: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/groupby/
- 过滤: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/filter/
- 连接: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/join/
- 重塑: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/melt/
- 条件应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/conditionally-apply/
- 排序: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/sorting/
- 透视: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/pivot/
- 应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/
- Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/udfs/
- Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/window-functions/
- Python数据分析 第二版: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/
- 第 1 章 准备工作: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-01/
- 第 2 章 Python 语法基础: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-02/
- 第 3 章 Python 的数据结构、函数和文件: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-03/
- 第 4 章 NumPy 基础：数组和向量计算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-04/
- 第 5 章 Pandas 入门: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-05/
- 第 6 章 数据加载、存储与文件格式: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-06/
- 第 7 章 数据清洗和准备: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-07/
- 第 10 章 数据聚合与分组运算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-10/
- 第 11 章 时间序列: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-11/
- 第 12 章 pandas 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-12/
- 第 13 章 Python 建模库介绍: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-13/
-  第 14 章 数据分析案例: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-14/
-  附录 A NumPy 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-A/
-  附录 B 更多关于 IPython 的内容: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-B/
- 第 8 章 数据规整：聚合、合并和重塑: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-08/
- 第 9 章 绘图和可视化: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-09/
- Polars用户指南: https://www.pythondataanalysis.com/docs/Polars_user_guide/
- Polars入门: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_getting_started/
- 安装Polars: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_installation/
- Polars核心概念: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/
- Polars数据类型和结构: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/data-types-and-structures/
- Polars表达式和上下文: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/expressions-and-contexts/
- Polars延迟API: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/lazy-api/
- Streaming: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/_streaming/
- Polars表达式: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/
- Polars基本操作: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/basic-operations/
- Aggregation: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/aggregation/
- Casting: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/casting/
- Categorical Data and Enums: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/categorical-data-and-enums/
- Expression Expansion: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/expression-expansion/
- Folds: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/folds/
- Lists and Arrays: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/lists-and-arrays/
- Missing Data: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/missing-data/
- Numpy Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/numpy-functions/
- Strings: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/strings/
- Structs: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/structs/
- User Defined Python Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/user-defined-python-functions/
- Window Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/window-functions/
- Reference: https://www.pythondataanalysis.com/docs/Polars_user_guide/api/reference/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/contributing/
- Versioning: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/versioning/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars-cloud/
- Ecosystem: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/ecosystem/
- Gpu Support: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/gpu-support/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/io/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/lazy/
- Pandas: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/pandas/
- Spark: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/spark/
- Arrow: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/arrow/
- Comparison: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/comparison/
- Multiprocessing: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/multiprocessing/
- Polars Llms: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/polars_llms/
- Styling: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/styling/
- Visualization: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/visualization/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/plugins/
- Create: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/create/
- Cte: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/cte/
- Intro: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/intro/
- Select: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/select/
- Show: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/show/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/transformations/


# 安装Polars

Polars是一个库，安装起来就像调用相应编程语言的包管理器一样简单。 

``` bash
pip install polars

#或者对于那些不支持高级矢量扩展指令集2（AVX2）的旧CPU
pip install polars-lts-cpu
```

``` shell
cargo add polars -F lazy

# Or Cargo.toml
[dependencies]
polars = { version = "x", features = ["lazy", ...]}
```

## 大索引

默认情况下，Polars dataframes的行数限制为2^32(约43亿)行。通过启用大索引扩展功能，可将此限制提升至2^64(约1800京）行： 

``` bash
pip install polars-u64-idx
```

``` shell
cargo add polars -F bigidx

# Or Cargo.toml
[dependencies]
polars = { version = "x", features = ["bigidx", ...] }
```

## 旧款CPU

在不支持高级矢量扩展指令集（[AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)）的旧款 CPU 上为 Python 安装 Polars，请运行： 

``` bash
pip install polars-lts-cpu
```

## 导入polars

要使用polars库，只需将其导入到你的项目中即可： 

``` python
import polars as pl
```

``` rust
use polars::prelude::*;
```

## 特性标志

通过使用上述命令，你可以将 Polars 的核心部分安装到你的系统上。然而，根据你的使用场景，你可能还需要安装一些可选的依赖项。将这些设置为可选的目的是尽量减少占用空间。根据编程语言的不同，相应的标志也有所不同。在整个用户指南中，当所使用的某项功能需要额外的依赖项时，将会提别提醒。 
### Python

```text
# 示例
pip install 'polars[numpy,fsspec]'
```

#### All

| 标志 | 说明                        |
| --- | ---------------------------------- |
| all | 安装所有可选的依赖项。 |

#### GPU

| 标志 | 说明                 |
| --- | --------------------------- |
| gpu | 在英伟达（NVIDIA）图形处理器（GPU）上运行查询。  |

**说明**

有关更详细的说明和先决条件，请参阅[GPU支持](gpu-support.md)相关内容。 

#### 互操作性

| 标志      | 说明                                        |
| -------- | -------------------------------------------------- |
| pandas   | Convert data to and from pandas dataframes/series. |
| numpy    | Convert data to and from NumPy arrays.             |
| pyarrow  | Convert data to and from PyArrow tables/arrays.    |
| pydantic | Convert data from Pydantic models to Polars.       |

#### Excel

| 标志        | 说明                                      |
| ---------- | ------------------------------------------------ |
| calamine   | Read from Excel files with the calamine engine.  |
| openpyxl   | Read from Excel files with the openpyxl engine.  |
| xlsx2csv   | Read from Excel files with the xlsx2csv engine.  |
| xlsxwriter | Write to Excel files with the XlsxWriter engine. |
| excel      | Install all supported Excel engines.             |

#### 数据库

| 标志        | 说明                                                                          |
| ---------- | ------------------------------------------------------------------------------------ |
| adbc       | Read from and write to databases with the Arrow Database Connectivity (ADBC) engine. |
| connectorx | Read from databases with the ConnectorX engine.                                      |
| sqlalchemy | Write to databases with the SQLAlchemy engine.                                       |
| database   | Install all supported database engines.                                              |

#### 云

| 标志    | 说明                                 |
| ------ | ------------------------------------------- |
| fsspec | Read from and write to remote file systems. |

#### 其他I/O

| 标志       | 说明                          |
| --------- | ------------------------------------ |
| deltalake | Read from and write to Delta tables. |
| iceberg   | Read from Apache Iceberg tables.     |

#### 其他

| 标志         | 说明                                     |
| ----------- | ----------------------------------------------- |
| async       | Collect LazyFrames asynchronously.              |
| cloudpickle | Serialize user-defined functions.               |
| graph       | Visualize LazyFrames as a graph.                |
| plot        | Plot dataframes through the `plot` namespace.   |
| style       | Style dataframes through the `style` namespace. |
| timezone    | Timezone support.仅使用Windows时才需要                        |

### Rust

```toml
# Cargo.toml
[dependencies]
polars = { version = "0.26.1", features = ["lazy", "temporal", "describe", "json", "parquet", "dtype-datetime"] }
```

可选择启用的功能如下:

- 额外的数据类型:
    - `dtype-date`
    - `dtype-datetime`
    - `dtype-time`
    - `dtype-duration`
    - `dtype-i8`
    - `dtype-i16`
    - `dtype-u8`
    - `dtype-u16`
    - `dtype-categorical`
    - `dtype-struct`
- `lazy` - Lazy API:
    - `regex` - 在列选择中使用正则表达式.
    - `dot_diagram` - 根据惰性逻辑计划创建点图。 
- `sql` - 将 SQL 查询传递给 Polars。
- `streaming` - 能够处理比内存容量更大的数据集。 
- `random` - 生成包含随机采样值的数组
- `ndarray`- 将`DataFrame`（数据框）转换为`ndarray`（多维数组） 
- `temporal` - 针对时间数据类型在 [Chrono](https://docs.rs/chrono/)（时间库）和 Polars（数据处理库）之间进行转换 
- `timezones` - 激活时区支持。
- `strings` - Extra string utilities for `StringChunked`:
    - `string_pad` - for `pad_start`, `pad_end`, `zfill`.
    - `string_to_integer` - for `parse_int`.
- `object` - Support for generic ChunkedArrays called `ObjectChunked<T>` (generic over `T`).
  These are downcastable from Series through the [Any](https://doc.rust-lang.org/std/any/index.html) trait.
- 性能相关:
    - `nightly` - Several nightly only features such as SIMD and specialization.
    - `performant` - more fast paths, slower compile times.
    - `bigidx` - Activate this feature if you expect >> $2^{32}$ rows.
    This allows polars to scale up way beyond that by using `u64` as an index.
    Polars will be a bit slower with this feature activated as many data structures
    are less cache efficient.
    - `cse` - Activate common subplan elimination optimization.
- IO相关:
    - `serde` - Support for [serde](https://crates.io/crates/serde) serialization and deserialization.
    Can be used for JSON and more serde supported serialization formats.
    - `serde-lazy` - Support for [serde](https://crates.io/crates/serde) serialization and deserialization.
    Can be used for JSON and more serde supported serialization formats.
    - `parquet` - Read Apache Parquet format.
    - `json` - JSON serialization.
    - `ipc` - Arrow's IPC format serialization.
    - `decompress` - Automatically infer compression of csvs and decompress them.
    Supported compressions:
      - gzip
      - zlib
      - zstd
- Dataframe操作:
    - `dynamic_group_by` - Group by based on a time window instead of predefined keys.
    Also activates rolling window group by operations.
    - `sort_multiple` - Allow sorting a dataframe on multiple columns.
    - `rows` - Create dataframe from rows and extract rows from `dataframes`.
    Also activates `pivot` and `transpose` operations.
    - `join_asof` - Join ASOF, to join on nearest keys instead of exact equality match.
    - `cross_join` - Create the Cartesian product of two dataframes.
    - `semi_anti_join` - SEMI and ANTI joins.
    - `row_hash` - Utility to hash dataframe rows to `UInt64Chunked`.
    - `diagonal_concat` - Diagonal concatenation thereby combining different schemas.
    - `dataframe_arithmetic` - Arithmetic between dataframes and other dataframes or series.
    - `partition_by` - Split into multiple dataframes partitioned by groups.
- Series/表达式操作:
    - `is_in` - Check for membership in Series.
    - `zip_with` - Zip two `Series` / `ChunkedArray`s.
    - `round_series` - round underlying float types of series.
    - `repeat_by` - Repeat element in an array a number of times specified by another array.
    - `is_first_distinct` - Check if element is first unique value.
    - `is_last_distinct` - Check if element is last unique value.
    - `checked_arithmetic` - checked arithmetic returning `None` on invalid operations.
    - `dot_product` - Dot/inner product on series and expressions.
    - `concat_str` - Concatenate string data in linear time.
    - `reinterpret` - Utility to reinterpret bits to signed/unsigned.
    - `take_opt_iter` - Take from a series with `Iterator<Item=Option<usize>>`.
    - `mode` - Return the most frequently occurring value(s).
    - `cum_agg` - `cum_sum`, `cum_min`, and `cum_max`, aggregations.
    - `rolling_window` - rolling window functions, like `rolling_mean`.
    - `interpolate` - Interpolate `None` values.
    - `extract_jsonpath` - [Run `jsonpath` queries on `StringChunked`](https://goessner.net/articles/JsonPath/).
    - `list` - List utils:
      - `list_gather` - take sublist by multiple indices.
    - `rank` - Ranking algorithms.
    - `moment` - Kurtosis and skew statistics.
    - `ewma` - Exponential moving average windows.
    - `abs` - Get absolute values of series.
    - `arange` - Range operation on series.
    - `product` - Compute the product of a series.
    - `diff` - `diff` operation.
    - `pct_change` - Compute change percentages.
    - `unique_counts` - Count unique values in expressions.
    - `log` - Logarithms for series.
    - `list_to_struct` - Convert `List` to `Struct` data types.
    - `list_count` - Count elements in lists.
    - `list_eval` - Apply expressions over list elements.
    - `cumulative_eval` - Apply expressions over cumulatively increasing windows.
    - `arg_where` - Get indices where condition holds.
    - `search_sorted` - Find indices where elements should be inserted to maintain order.
    - `offset_by` - Add an offset to dates that take months and leap years into account.
    - `trigonometry` - 三角函数.
    - `sign` - 计算一个序列中每个元素的符号（正、负或零）。 
    - `propagate_nans` - `NaN`-propagating min/max aggregations.
- Dataframe美化格式化:
    - `fmt` - 激活Dataframe格式化功能.