Python Pandas – 将CategoricalIndex的类别设置为有序

在Pandas中，通过将分类数据转换为CategoricalIndex可以提高数据分析的效率，不仅可以减少内存使用，还可以提高查询和计算的速度。然而，有时候我们需要将CategoricalIndex的类别设置为有序，这样可以方便排序和比较。本文将介绍如何将CategoricalIndex的类别设置为有序。

学习前提

在学习本文之前，你需要掌握以下基础知识：

Python基本语法
Pandas基本操作
分类数据的基本知识

创建CategoricalIndex

在Pandas中，我们可以通过pd.Categorical()方法将一个Series或一个列表转换为一个分类数据。下面是一个示例代码：

import pandas as pd

s = pd.Series(['cat', 'dog', 'cat', 'bird', 'dog'])
cat_s = pd.Categorical(s)
print(cat_s)

输出结果为：

[cat, dog, cat, bird, dog]
Categories (3, object): [bird, cat, dog]

这里的cat_s就是一个CategoricalIndex，它的类别为[bird, cat, dog]。

设置类别的顺序

如何将CategoricalIndex的类别设置为有序呢？我们可以通过在创建CategoricalIndex时使用ordered=True参数来指定类别的顺序。下面是一个示例代码：

import pandas as pd

s = pd.Series(['cat', 'dog', 'cat', 'bird', 'dog'])
cat_s = pd.Categorical(s, ordered=True, categories=['cat', 'dog', 'bird'])
print(cat_s)

输出结果为：

[cat, dog, cat, bird, dog]
Categories (3, object): [cat < dog < bird]

这里的cat_s就是一个有序的CategoricalIndex，它的类别顺序为[cat < dog < bird]。

需要注意的是，当使用categories参数指定类别时，如果原始数据中出现了没有被指定的类别，将会被视为缺失值。如果希望将这些缺失值显示出来，我们可以使用dropna=False参数，如下所示：

import pandas as pd

s = pd.Series(['cat', 'dog', 'cat', 'bird', 'dog', 'fish'])
cat_s = pd.Categorical(s, ordered=True, categories=['cat', 'dog', 'bird'], dropna=False)
print(cat_s)

输出结果为：

[cat, dog, cat, bird, dog, NaN]
Categories (3, object): [cat < dog < bird]

这里的cat_s中有一个缺失值，它被显示为NaN。

排序

当我们将CategoricalIndex的类别设置为有序后，就可以方便排序了。我们可以使用sort_values()方法对CategoricalIndex进行排序。下面是一个示例代码：

import pandas as pd

s = pd.Series(['cat', 'dog', 'cat', 'bird', 'dog'])
cat_s = pd.Categorical(s, ordered=True, categories=['cat', 'dog', 'bird'])
print(cat_s.sort_values())

输出结果为：

[cat, cat, dog, dog, bird]
Categories (3, object): [cat < dog < bird]

这里对cat_s进行了排序，输出结果为按照[cat < dog < bird]的顺序进行排序的CategoricalIndex。

需要注意的是，如果CategoricalIndex中包含缺失值，则排序时它们会被放在最后。如果希望缺失值被放在最前面，可以使用na_position参数，如下所示：

import pandas as pd

s = pd.Series(['cat', 'dog', 'cat', 'bird', 'dog', None])
cat_s = pd.Categorical(s, ordered=True, categories=['cat', 'dog', 'bird'], dropna=False)
print(cat_s.sort_values(na_position='first'))

输出结果为：

[NaN, cat, cat, dog, dog, bird]
Categories (3, object): [cat < dog < bird]

这里对cat_s进行了排序，并且将缺失值放在了最前面。

需要注意的是，排序方法只能用于CategoricalIndex，而不能直接用于Series。如果想要对Series中的分类数据排序，我们可以先通过astype()方法将其转换为CategoricalIndex，然后再进行排序，如下所示：

import pandas as pd

s = pd.Series(['cat', 'dog', 'cat', 'bird', 'dog'])
cat_s = pd.Categorical(s, ordered=True, categories=['cat', 'dog', 'bird'])
sorted_s = s.astype(cat_s.dtype).sort_values()
print(sorted_s)

输出结果为：

0     cat
2     cat
1     dog
4     dog
3    bird
dtype: object

比较

当我们将CategoricalIndex的类别设置为有序后，就可以进行比较了。我们可以使用一些比较运算符（如<, <=, >, >=, ==等）对CategoricalIndex进行比较。下面是一个示例代码：

import pandas as pd

s1 = pd.Series(['cat', 'dog', 'cat', 'bird', 'dog'])
cat_s1 = pd.Categorical(s1, ordered=True, categories=['cat', 'dog', 'bird'])
s2 = pd.Series(['cat', 'bird', 'dog', 'dog', 'bird'])
cat_s2 = pd.Categorical(s2, ordered=True, categories=['cat', 'dog', 'bird'])

print(cat_s1 < cat_s2)

输出结果为：

[False, True, False, True, False]

这里对cat_s1和cat_s2进行了比较，输出结果为一个布尔列表，表示两个CategoricalIndex中每个元素的大小关系。可以看到，我们成功地对CategoricalIndex进行了比较。

需要注意的是，只有当两个CategoricalIndex的类别相同时才能进行比较。如果类别不同，会抛出TypeError异常，如下所示：

import pandas as pd

s1 = pd.Series(['cat', 'dog', 'cat', 'bird', 'dog'])
cat_s1 = pd.Categorical(s1, ordered=True, categories=['cat', 'dog', 'bird'])
s2 = pd.Series(['cat', 'bird', 'dog', 'dog', 'fish'])
cat_s2 = pd.Categorical(s2, ordered=True, categories=['cat', 'dog', 'bird'])

print(cat_s1 < cat_s2)

输出结果为：

TypeError: Cannot compare Categorical and non-Categorical`

这里因为s2中包含了'fish'这个不同于cat_s1中的类别，因此无法进行比较。

结论

在Pandas中，通过将分类数据转换为CategoricalIndex可以提高数据分析的效率，而将CategoricalIndex的类别设置为有序可以方便排序和比较。我们可以使用ordered=True参数将CategoricalIndex的类别设置为有序，并使用sort_values()方法对其进行排序，使用比较运算符对其进行比较。需要注意的是，类别必须相同才能进行比较，如果类别不同会抛出异常。