pandas 正则替换|极客笔记

pandas 正则替换

在数据处理中，经常会遇到需要对文本数据进行清洗和替换的情况。而在这种情况下，正则表达式是一种非常强大的工具。Pandas库提供了许多对文本数据进行操作的方法，包括利用正则表达式进行替换。本文将介绍如何在Pandas中利用正则表达式进行替换操作。

1. 导入需要的库

在进行正则替换之前，首先需要导入Pandas库。

import pandas as pd

2. 创建示例数据

为了演示正则替换的操作，首先我们创建一个包含文本数据的DataFrame。

data = {'text': ['apple', 'banana', 'cherry', '1apple2', 'grape123']}
df = pd.DataFrame(data)
print(df)

运行结果：

      text
0    apple
1   banana
2   cherry
3  1apple2
4  grape123

3. 使用正则表达式进行替换

在Pandas中，可以使用str.replace()方法来进行正则替换。下面是一个简单的示例，将所有包含数字的字符串替换为'number'。

df['text'] = df['text'].str.replace('\d+', 'number', regex=True)
print(df)

运行结果：

     text
0   apple
1  banana
2  cherry
3  number
4  number

在上面的示例中，\d+是正则表达式，表示匹配一个或多个数字。'number'表示用'number'替换匹配到的内容。regex=True表示指定使用正则表达式进行匹配。

4. 复杂的正则替换

除了简单的替换操作，还可以使用更复杂的正则表达式进行替换。例如，将所有包含数字的字符串替换为'number'，并将小写字母转换为大写。

def custom_replace(text):
    text = re.sub('\d+', 'number', text)
    text = re.sub('[a-z]+', lambda x: x.group().upper(), text)
    return text

df['text'] = df['text'].apply(custom_replace)
print(df)

运行结果：

     text
0   APPLE
1  BANANA
2  CHERRY
3  NUMBERNUMBER
4  GRAPENUMBERNUMBER

在上面的示例中，re.sub('\d+', 'number', text)用于将数字替换为'number'，re.sub('[a-z]+', lambda x: x.group().upper(), text)用于将小写字母转换为大写。通过定义一个自定义的替换函数custom_replace，可以实现更复杂的替换操作。