Python数据分析
- PolarsBook中文版: https://www.pythondataanalysis.com/docs/polars_book_cn/
- Polars快速入门: https://www.pythondataanalysis.com/docs/polars_book_cn/quickstart/
- Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/
- Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/expressions/
- Polars上下文: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/contexts/
- Polars分组: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/groupby/
- Polars折叠: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/folds/
- Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/custom_functions/
- Polars实例: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/introduction_polars/
- Polars表达式方法: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/api/
- Polars视频介绍: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/video_intro/
- Polars与Numpy交互: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/numpy/
- Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/window_functions/
- Polars索引: https://www.pythondataanalysis.com/docs/polars_book_cn/indexing/
- Polars数据类型: https://www.pythondataanalysis.com/docs/polars_book_cn/datatypes/
- 来自Pandas: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_pandas/
- 来自ApacheSpark: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_spark/
- Polars性能: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/
- 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/strings/
- Polars优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/
- Polars惰性方法: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/
- 谓词下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/predicate-pushdown/
- 投影下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/projection-pushdown/
- 其它优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/other-optimizations/
- Polars参考指南: https://www.pythondataanalysis.com/docs/polars_book_cn/references/
- Polars时间序列: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/
- Polars时间序列实例: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/time-series/
- Polars使用范围: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/
- IO: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/
- Polars操作CSV文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/csv/
- Polars操作Parquet文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/parquet/
- Polars处理多个文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/multiple_files/
- Polars读取数据库: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/read_db/
- Polars与AWS交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/aws/
- Polars与Google BigQuery交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/google-big-query/
- Polars与Postgres交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/postgres/
- 互通性: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/
- Arrow: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/arrow/
- Numpy: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/numpy/
- 数据: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/
- 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/strings/
- 时间戳: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/timestamps/
- 数据帧: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/
- 选中行或列: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/row_col_selection/
- 常用操作: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/common-manipulations/
- 聚合: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/aggregate/
- 分组: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/groupby/
- 过滤: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/filter/
- 连接: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/join/
- 重塑: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/melt/
- 条件应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/conditionally-apply/
- 排序: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/sorting/
- 透视: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/pivot/
- 应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/
- Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/udfs/
- Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/window-functions/
- Python数据分析 第二版: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/
- 第 1 章 准备工作: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-01/
- 第 2 章 Python 语法基础: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-02/
- 第 3 章 Python 的数据结构、函数和文件: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-03/
- 第 4 章 NumPy 基础:数组和向量计算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-04/
- 第 5 章 Pandas 入门: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-05/
- 第 6 章 数据加载、存储与文件格式: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-06/
- 第 7 章 数据清洗和准备: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-07/
- 第 10 章 数据聚合与分组运算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-10/
- 第 11 章 时间序列: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-11/
- 第 12 章 pandas 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-12/
- 第 13 章 Python 建模库介绍: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-13/
- 第 14 章 数据分析案例: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-14/
- 附录 A NumPy 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-A/
- 附录 B 更多关于 IPython 的内容: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-B/
- 第 8 章 数据规整:聚合、合并和重塑: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-08/
- 第 9 章 绘图和可视化: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-09/
- Polars用户指南: https://www.pythondataanalysis.com/docs/Polars_user_guide/
- Polars入门: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_getting_started/
- 安装Polars: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_installation/
- Polars核心概念: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/
- Polars数据类型和结构: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/data-types-and-structures/
- Polars表达式和上下文: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/expressions-and-contexts/
- Polars延迟API: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/lazy-api/
- Streaming: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/_streaming/
- Polars表达式: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/
- Polars基本操作: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/basic-operations/
- Aggregation: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/aggregation/
- Casting: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/casting/
- Categorical Data and Enums: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/categorical-data-and-enums/
- Expression Expansion: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/expression-expansion/
- Folds: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/folds/
- Lists and Arrays: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/lists-and-arrays/
- Missing Data: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/missing-data/
- Numpy Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/numpy-functions/
- Strings: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/strings/
- Structs: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/structs/
- User Defined Python Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/user-defined-python-functions/
- Window Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/window-functions/
- Reference: https://www.pythondataanalysis.com/docs/Polars_user_guide/api/reference/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/contributing/
- Versioning: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/versioning/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars-cloud/
- Ecosystem: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/ecosystem/
- Gpu Support: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/gpu-support/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/io/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/lazy/
- Pandas: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/pandas/
- Spark: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/spark/
- Arrow: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/arrow/
- Comparison: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/comparison/
- Multiprocessing: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/multiprocessing/
- Polars Llms: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/polars_llms/
- Styling: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/styling/
- Visualization: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/visualization/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/plugins/
- Create: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/create/
- Cte: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/cte/
- Intro: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/intro/
- Select: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/select/
- Show: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/show/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/transformations/
# Categorical data and enums
A column that holds string values that can only take on one of a limited number of possible values
is a column that holds [categorical data](https://en.wikipedia.org/wiki/Categorical_variable).
Usually, the number of possible values is much smaller than the length of the column. Some typical
examples include your nationality, the operating system of your computer, or the license that your
favorite open source project uses.
When working with categorical data you can use Polars' dedicated types, `Categorical` and `Enum`, to
make your queries more performant. Now, we will see what are the differences between the two data
types `Categorical` and `Enum` and when you should use one data type or the other. We also include
some notes on
[why the data types `Categorical` and `Enum` are more efficient than using the plain string values](#performance-considerations-on-categorical-data-types)
in the end of this user guide section.
## `Enum` vs `Categorical`
In short, you should prefer `Enum` over `Categorical` whenever possible. When the categories are
fixed and known up front, use `Enum`. When you don't know the categories or they are not fixed then
you must use `Categorical`. In case your requirements change along the way you can always cast from
one to the other.
## Data type `Enum`
### Creating an `Enum`
The data type `Enum` is an ordered categorical data type. To use the data type `Enum` you have to
specify the categories in advance to create a new data type that is a variant of an `Enum`. Then,
when creating a new series, a new dataframe, or when casting a string column, you can use that
`Enum` variant.
{{code_block('user-guide/expressions/categoricals', 'enum-example', ['Enum'])}}
```python exec="on" result="text" session="expressions/categoricals"
--8<-- "python/user-guide/expressions/categoricals.py:enum-example"
```
### Invalid values
Polars will raise an error if you try to specify a data type `Enum` whose categories do not include
all the values present:
{{code_block('user-guide/expressions/categoricals', 'enum-wrong-value', ['Enum'])}}
```python exec="on" result="text" session="expressions/categoricals"
--8<-- "python/user-guide/expressions/categoricals.py:enum-wrong-value"
```
If you are in a position where you cannot know all of the possible values in advance and erroring on
unknown values is semantically wrong, you may need to
[use the data type `Categorical`](#data-type-categorical).
### Category ordering and comparison
The data type `Enum` is ordered and the order is induced by the order in which you specify the
categories. The example below uses log levels as an example of where an ordered `Enum` is useful:
{{code_block('user-guide/expressions/categoricals', 'log-levels', ['Enum'])}}
```python exec="on" result="text" session="expressions/categoricals"
--8<-- "python/user-guide/expressions/categoricals.py:log-levels"
```
This example shows that we can compare `Enum` values with a string, but this only works if the
string matches one of the `Enum` values. If we compared the column “level” with any string other
than `"debug"`, `"info"`, `"warning"`, or `"error"`, Polars would raise an exception.
Columns with the data type `Enum` can also be compared with other columns that have the same data
type `Enum` or columns that hold strings, but only if all the strings are valid `Enum` values.
## Data type `Categorical`
The data type `Categorical` can be seen as a more flexible version of `Enum`.
### Creating a `Categorical` series
To use the data type `Categorical`, you can cast a column of strings or specify `Categorical` as the
data type of a series or dataframe column:
{{code_block('user-guide/expressions/categoricals', 'categorical-example', ['Categorical'])}}
```python exec="on" result="text" session="expressions/categoricals"
--8<-- "python/user-guide/expressions/categoricals.py:categorical-example"
```
Having Polars infer the categories for you may sound strictly better than listing the categories
beforehand, but this inference comes with a performance cost. That is why, whenever possible, you
should use `Enum`. You can learn more by
[reading the subsection about the data type `Categorical` and its encodings](#data-type-categorical-and-encodings).
### Lexical comparison with strings
When comparing a `Categorical` column with a string, Polars will perform a lexical comparison:
{{code_block('user-guide/expressions/categoricals', 'categorical-comparison-string',
['Categorical'])}}
```python exec="on" result="text" session="expressions/categoricals"
--8<-- "python/user-guide/expressions/categoricals.py:categorical-comparison-string"
```
You can also compare a column of strings with your `Categorical` column, and the comparison will
also be lexical:
{{code_block('user-guide/expressions/categoricals', 'categorical-comparison-string-column',
['Categorical'])}}
```python exec="on" result="text" session="expressions/categoricals"
--8<-- "python/user-guide/expressions/categoricals.py:categorical-comparison-string-column"
```
Although it is possible to compare a string column with a categorical column, it is typically more
efficient to compare two categorical columns. We will see how to do that next.
### Comparing `Categorical` columns and the string cache
You are told that comparing columns with the data type `Categorical` is more efficient than if one
of them is a string column. So, you change your code so that the second column is also a categorical
column and then you perform your comparison... But Polars raises an exception:
{{code_block('user-guide/expressions/categoricals', 'categorical-comparison-categorical-column',
['Categorical'])}}
```python exec="on" result="text" session="expressions/categoricals"
--8<-- "python/user-guide/expressions/categoricals.py:categorical-comparison-categorical-column"
```
By default, the values in columns with the data type `Categorical` are
[encoded in the order they are seen in the column](#encodings), and independently from other
columns, which means that Polars cannot compare efficiently two categorical columns that were
created independently.
Enabling the Polars string cache and creating the columns with the cache enabled fixes this issue:
{{code_block('user-guide/expressions/categoricals', 'stringcache-categorical-equality',
['StringCache', 'Categorical'])}}
```python exec="on" result="text" session="expressions/categoricals"
--8<-- "python/user-guide/expressions/categoricals.py:stringcache-categorical-equality"
```
Note that using [the string cache comes at a performance cost](#using-the-global-string-cache).
### Combining `Categorical` columns
The string cache is also useful in any operation that combines or mixes two columns with the data
type `Categorical` in any way. An example of this is when
[concatenating two dataframes vertically](../getting-started.md#concatenating-dataframes):
{{code_block('user-guide/expressions/categoricals', 'concatenating-categoricals', ['StringCache',
'Categorical'])}}
```python exec="on" result="text" session="expressions/categoricals"
--8<-- "python/user-guide/expressions/categoricals.py:concatenating-categoricals"
```
In this case, Polars issues a warning complaining about an expensive reenconding that implies taking
a performance hit. Polars then suggests using the data type `Enum` if possible, or using the string
cache. To understand the issue with this operation and why Polars raises an error, please read the
final section about
[the performance considerations of using categorical data types](#performance-considerations-on-categorical-data-types).
### Comparison between `Categorical` columns is not lexical
When comparing two columns with data type `Categorical`, Polars does not perform lexical comparison
between the values by default. If you want lexical ordering, you need to specify so when creating
the column:
{{code_block('user-guide/expressions/categoricals', 'stringcache-categorical-comparison-lexical',
['StringCache', 'Categorical'])}}
```python exec="on" result="text" session="expressions/categoricals"
--8<-- "python/user-guide/expressions/categoricals.py:stringcache-categorical-comparison-lexical"
```
Otherwise, the order is inferred together with the values:
{{code_block('user-guide/expressions/categoricals', 'stringcache-categorical-comparison-physical',
['StringCache', 'Categorical'])}}
```python exec="on" result="text" session="expressions/categoricals"
--8<-- "python/user-guide/expressions/categoricals.py:stringcache-categorical-comparison-physical"
```
## Performance considerations on categorical data types
This part of the user guide explains
- why categorical data types are more performant than the string literals; and
- why Polars needs a string cache when doing some operations with the data type `Categorical`.
### Encodings
Categorical data represents string data where the values in the column have a finite set of values
(usually way smaller than the length of the column). Storing these values as plain strings is a
waste of memory and performance as we will be repeating the same string over and over again.
Additionally, in operations like joins we have to perform expensive string comparisons.
Categorical data types like `Enum` and `Categorical` let you encode the string values in a cheaper
way, establishing a relationship between a cheap encoding value and the original string literal.
As an example of a sensible encoding, Polars could choose to represent the finite set of categories
as positive integers. With that in mind, the diagram below shows a regular string column and a
possible representation of a Polars column with the categorical data type:
| String Column | Categorical Column |
| Series |
| Polar |
| Panda |
| Brown |
| Panda |
| Brown |
| Brown |
| Polar |
|
|
|
| Categories |
| Polar |
| Panda |
| Brown |
|
|
The physical `0` in this case encodes (or maps) to the value 'Polar', the value `1` encodes to
'Panda', and the value `2` to 'Brown'. This encoding has the benefit of only storing the string
values once. Additionally, when we perform operations (e.g. sorting, counting) we can work directly
on the physical representation which is much faster than the working with string data.
### Encodings for the data type `Enum` are global
When working with the data type `Enum` we specify the categories in advance. This way, Polars can
ensure different columns and even different datasets have the same encoding and there is no need for
expensive re-encoding or cache lookups.
### Data type `Categorical` and encodings
The fact that the categories for the data type `Categorical` are inferred come at a cost. The main
cost here is that we have no control over our encodings.
Consider the following scenario where we append the following two categorical series:
{{code_block('user-guide/concepts/data-types/categoricals','append',[])}}
Polars encodes the string values in the order they appear. So, the series would look like this:
| cat_series | cat2_series |
|
|
| Categories |
| Polar |
| Panda |
| Brown |
|
|
|
|
| Categories |
| Panda |
| Brown |
| Polar |
|
|
Combining the series becomes a non-trivial task which is expensive as the physical value of `0`
represents something different in both series. Polars does support these types of operations for
convenience, however these should be avoided due to its slower performance as it requires making
both encodings compatible first before doing any merge operations.
### Using the global string cache
One way to handle this reencoding problem is to enable the string cache. Under the string cache, the
diagram would instead look like this:
| Series | String cache |
|
|
| Categories |
| Polar |
| Panda |
| Brown |
|
When you enable the string cache, strings are no longer encoded in the order they appear on a
per-column basis. Instead, the encoding is shared across columns. The value 'Polar' will always be
encoded by the same value for all categorical columns created under the string cache. Merge
operations (e.g. appends, joins) become cheap again as there is no need to make the encodings
compatible first, solving the problem we had above.
However, the string cache does come at a small performance hit during construction of the series as
we need to look up or insert the string values in the cache. Therefore, it is preferred to use the
data type `Enum` if you know your categories in advance.