Python数据分析

- PolarsBook中文版: https://www.pythondataanalysis.com/docs/polars_book_cn/
- Polars快速入门: https://www.pythondataanalysis.com/docs/polars_book_cn/quickstart/
- Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/
- Polars表达式: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/expressions/
- Polars上下文: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/contexts/
- Polars分组: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/groupby/
- Polars折叠: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/folds/
- Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/custom_functions/
- Polars实例: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/introduction_polars/
- Polars表达式方法: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/api/
- Polars视频介绍: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/video_intro/
- Polars与Numpy交互: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/numpy/
- Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/dsl/window_functions/
- Polars索引: https://www.pythondataanalysis.com/docs/polars_book_cn/indexing/
- Polars数据类型: https://www.pythondataanalysis.com/docs/polars_book_cn/datatypes/
- 来自Pandas: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_pandas/
- 来自ApacheSpark: https://www.pythondataanalysis.com/docs/polars_book_cn/coming_from_spark/
- Polars性能: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/
- 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/performance/strings/
- Polars优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/
- Polars惰性方法: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/
- 谓词下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/predicate-pushdown/
- 投影下推: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/projection-pushdown/
- 其它优化: https://www.pythondataanalysis.com/docs/polars_book_cn/optimizations/lazy/other-optimizations/
- Polars参考指南: https://www.pythondataanalysis.com/docs/polars_book_cn/references/
- Polars时间序列: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/
- Polars时间序列实例: https://www.pythondataanalysis.com/docs/polars_book_cn/timeseries/time-series/
- Polars使用范围: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/
- IO: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/
- Polars操作CSV文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/csv/
- Polars操作Parquet文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/parquet/
- Polars处理多个文件: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/multiple_files/
- Polars读取数据库: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/read_db/
- Polars与AWS交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/aws/
- Polars与Google BigQuery交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/google-big-query/
- Polars与Postgres交互: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/io/postgres/
- 互通性: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/
- Arrow: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/arrow/
- Numpy: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/interop/numpy/
- 数据: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/
- 字符串: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/strings/
- 时间戳: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/data/timestamps/
- 数据帧: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/
- 选中行或列: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/row_col_selection/
- 常用操作: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/common-manipulations/
- 聚合: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/aggregate/
- 分组: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/groupby/
- 过滤: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/filter/
- 连接: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/join/
- 重塑: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/melt/
- 条件应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/conditionally-apply/
- 排序: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/sorting/
- 透视: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/df/pivot/
- 应用: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/
- Polars自定义函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/udfs/
- Polars窗口函数: https://www.pythondataanalysis.com/docs/polars_book_cn/howcani/apply/window-functions/
- Python数据分析 第二版: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/
- 第 1 章 准备工作: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-01/
- 第 2 章 Python 语法基础: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-02/
- 第 3 章 Python 的数据结构、函数和文件: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-03/
- 第 4 章 NumPy 基础：数组和向量计算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-04/
- 第 5 章 Pandas 入门: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-05/
- 第 6 章 数据加载、存储与文件格式: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-06/
- 第 7 章 数据清洗和准备: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-07/
- 第 10 章 数据聚合与分组运算: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-10/
- 第 11 章 时间序列: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-11/
- 第 12 章 pandas 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-12/
- 第 13 章 Python 建模库介绍: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-13/
-  第 14 章 数据分析案例: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-14/
-  附录 A NumPy 高级应用: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-A/
-  附录 B 更多关于 IPython 的内容: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Appendix-B/
- 第 8 章 数据规整：聚合、合并和重塑: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-08/
- 第 9 章 绘图和可视化: https://www.pythondataanalysis.com/docs/Python_Data_Analysis_2nd_Editon/Chapter-09/
- Polars用户指南: https://www.pythondataanalysis.com/docs/Polars_user_guide/
- Polars入门: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_getting_started/
- 安装Polars: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars_installation/
- Polars核心概念: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/
- Polars数据类型和结构: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/data-types-and-structures/
- Polars表达式和上下文: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/expressions-and-contexts/
- Polars延迟API: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/lazy-api/
- Streaming: https://www.pythondataanalysis.com/docs/Polars_user_guide/concepts/_streaming/
- Polars表达式: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/
- Polars基本操作: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/basic-operations/
- Aggregation: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/aggregation/
- Casting: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/casting/
- Categorical Data and Enums: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/categorical-data-and-enums/
- Expression Expansion: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/expression-expansion/
- Folds: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/folds/
- Lists and Arrays: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/lists-and-arrays/
- Missing Data: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/missing-data/
- Numpy Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/numpy-functions/
- Strings: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/strings/
- Structs: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/structs/
- User Defined Python Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/user-defined-python-functions/
- Window Functions: https://www.pythondataanalysis.com/docs/Polars_user_guide/expressions/window-functions/
- Reference: https://www.pythondataanalysis.com/docs/Polars_user_guide/api/reference/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/contributing/
- Versioning: https://www.pythondataanalysis.com/docs/Polars_user_guide/development/versioning/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/polars-cloud/
- Ecosystem: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/ecosystem/
- Gpu Support: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/gpu-support/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/io/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/lazy/
- Pandas: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/pandas/
- Spark: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/migration/spark/
- Arrow: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/arrow/
- Comparison: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/comparison/
- Multiprocessing: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/multiprocessing/
- Polars Llms: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/polars_llms/
- Styling: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/styling/
- Visualization: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/misc/visualization/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/plugins/
- Create: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/create/
- Cte: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/cte/
- Intro: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/intro/
- Select: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/select/
- Show: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/sql/show/
- Index: https://www.pythondataanalysis.com/docs/Polars_user_guide/user-guide/transformations/

# Missing data

This section of the user guide teaches how to work with missing data in Polars.

## `null` and `NaN` values

In Polars, missing data is represented by the value `null`. This missing value `null` is used for
all data types, including numerical types.

Polars also supports the value `NaN` (“Not a Number”) for columns with floating point numbers. The
value `NaN` is considered to be a valid floating point value, which is different from missing data.
[We discuss the value `NaN` separately below](#not-a-number-or-nan-values).

When creating a series or a dataframe, you can set a value to `null` by using the appropriate
construct for your language:

{{code_block('user-guide/expressions/missing-data','dataframe',['DataFrame'])}}

```python exec="on" result="text" session="user-guide/missing-data"
--8<-- "python/user-guide/expressions/missing-data.py:dataframe"
```

!!! info "Difference from pandas"

    In pandas, the value used to represent missing data depends on the data type of the column.
    In Polars, missing data is always represented by the value `null`.

## Missing data metadata

Polars keeps track of some metadata regarding the missing data of each series. This metadata allows
Polars to answer some basic queries about missing values in a very efficient way, namely how many
values are missing and which ones are missing.

To determine how many values are missing from a column you can use the function `null_count`:

{{code_block('user-guide/expressions/missing-data','count',['null_count'])}}

```python exec="on" result="text" session="user-guide/missing-data"
--8<-- "python/user-guide/expressions/missing-data.py:count"
```

The function `null_count` can be called on a dataframe, a column from a dataframe, or on a series
directly. The function `null_count` is a cheap operation because the result is already known.

Polars uses something called a “validity bitmap” to know which values are missing in a series. The
validity bitmap is memory efficient as it is bit encoded. If a series has length $n$, then its
validity bitmap will cost $n / 8$ bytes. The function `is_null` uses the validity bitmap to
efficiently report which values are `null` and which are not:

{{code_block('user-guide/expressions/missing-data','isnull',['is_null'])}}

```python exec="on" result="text" session="user-guide/missing-data"
--8<-- "python/user-guide/expressions/missing-data.py:isnull"
```

The function `is_null` can be used on a column of a dataframe or on a series directly. Again, this
is a cheap operation because the result is already known by Polars.

??? info "Why does Polars waste memory on a validity bitmap?"

    It all comes down to a tradeoff.
    By using a bit more memory per column, Polars can be much more efficient when performing most operations on your columns.
    If the validity bitmap wasn't known, every time you wanted to compute something you would have to check each position of the series to see if a legal value was present or not.
    With the validity bitmap, Polars knows automatically the positions where your operations can be applied.

## Filling missing data

Missing data in a series can be filled with the function `fill_null`. You can specify how missing
data is effectively filled in a couple of different ways:

- a literal of the correct data type;
- a Polars expression, such as replacing with values computed from another column;
- a strategy based on neighbouring values, such as filling forwards or backwards; and
- interpolation.

To illustrate how each of these methods work we start by defining a simple dataframe with two
missing values in the second column:

{{code_block('user-guide/expressions/missing-data','dataframe2',['DataFrame'])}}

```python exec="on" result="text" session="user-guide/missing-data"
--8<-- "python/user-guide/expressions/missing-data.py:dataframe2"
```

### Fill with a specified literal value

You can fill the missing data with a specified literal value. This literal value will replace all of
the occurrences of the value `null`:

{{code_block('user-guide/expressions/missing-data','fill',['fill_null'])}}

```python exec="on" result="text" session="user-guide/missing-data"
--8<-- "python/user-guide/expressions/missing-data.py:fill"
```

However, this is actually just a special case of the general case where
[the function `fill_null` replaces missing values with the corresponding values from the result of a Polars expression](#fill-with-a-strategy-based-on-neighbouring-values),
as seen next.

### Fill with an expression

In the general case, the missing data can be filled by extracting the corresponding values from the
result of a general Polars expression. For example, we can fill the second column with values taken
from the double of the first column:

{{code_block('user-guide/expressions/missing-data','fillexpr',['fill_null'])}}

```python exec="on" result="text" session="user-guide/missing-data"
--8<-- "python/user-guide/expressions/missing-data.py:fillexpr"
```

### Fill with a strategy based on neighbouring values

You can also fill the missing data by following a fill strategy based on the neighbouring values.
The two simpler strategies look for the first non-`null` value that comes immediately before or
immediately after the value `null` that is being filled:

{{code_block('user-guide/expressions/missing-data','fillstrategy',['fill_null'])}}

```python exec="on" result="text" session="user-guide/missing-data"
--8<-- "python/user-guide/expressions/missing-data.py:fillstrategy"
```

You can find other fill strategies in the API docs.

### Fill with interpolation

Additionally, you can fill missing data with interpolation by using the function `interpolate`
instead of the function `fill_null`:

{{code_block('user-guide/expressions/missing-data','fillinterpolate',['interpolate'])}}

```python exec="on" result="text" session="user-guide/missing-data"
--8<-- "python/user-guide/expressions/missing-data.py:fillinterpolate"
```

## Not a Number, or `NaN` values

Missing data in a series is only ever represented by the value `null`, regardless of the data type
of the series. Columns with a floating point data type can sometimes have the value `NaN`, which
might be confused with `null`.

The special value `NaN` can be created directly:

{{code_block('user-guide/expressions/missing-data','nan',['DataFrame'])}}

```python exec="on" result="text" session="user-guide/missing-data"
--8<-- "python/user-guide/expressions/missing-data.py:nan"
```

And it might also arise as the result of a computation:

{{code_block('user-guide/expressions/missing-data','nan-computed',[])}}

```python exec="on" result="text" session="user-guide/missing-data"
--8<-- "python/user-guide/expressions/missing-data.py:nan-computed"
```

!!! info

    By default, a `NaN` value in an integer column causes the column to be cast to a float data type in pandas.
    This does not happen in Polars; instead, an exception is raised.

`NaN` values are considered to be a type of floating point data and are **not considered to be
missing data** in Polars. This means:

- `NaN` values are **not** counted with the function `null_count`; and
- `NaN` values are filled when you use the specialised function `fill_nan` method but are **not**
  filled with the function `fill_null`.

Polars has the functions `is_nan` and `fill_nan`, which work in a similar way to the functions
`is_null` and `fill_null`. Unlike with missing data, Polars does not hold any metadata regarding the
`NaN` values, so the function `is_nan` entails actual computation.

One further difference between the values `null` and `NaN` is that numerical aggregating functions,
like `mean` and `sum`, skip the missing values when computing the result, whereas the value `NaN` is
considered for the computation and typically propagates into the result. If desirable, this behavior
can be avoided by replacing the occurrences of the value `NaN` with the value `null`:

{{code_block('user-guide/expressions/missing-data','nanfill',['fill_nan'])}}

```python exec="on" result="text" session="user-guide/missing-data"
--8<-- "python/user-guide/expressions/missing-data.py:nanfill"
```

You can learn more about the value `NaN` in
[the section about floating point number data types](../concepts/data-types-and-structures.md#floating-point-numbers).