Python 3 – 正则表达式

Python是一种很优秀的编程语言，它有着强大的正则表达式模块用来处理字符串。正则表达式通常用于匹配、搜索甚至替换文本。而它可以通过编译生成一个模式来匹配字符串。在Python 3中，正则表达式可以通过re模块来实现。

re模块基本函数

在 Python 3 中，re模块提供了很多函数来支持正则表达式。下面是一些基本函数：

compile(pattern, flags=0)：编译一个正则表达式成为一个对象。
match(pattern, string, flags=0)：从起始位置开始匹配一个正则表达式，如果匹配成功则返回Match对象。
search(pattern, string, flags=0)：扫描整个字符串查找匹配。
findall(pattern, string, flags=0)：从整个字符串中查找所有匹配。
finditer(pattern, string, flags=0)：与findall方法类似，但返回一个迭代器而不是一个列表。

下面是一些实例：

import re

# compile函数
pattern = re.compile('([a-z]+)-([a-z]+)')
result = pattern.match('hello-world')
print(result.group(0))  # hello-world
print(result.group(1))  # hello
print(result.group(2))  # world

# match函数
result = re.match('([a-z]+)-([a-z]+)', 'hello-world')
print(result.group(0))  # hello-world
print(result.group(1))  # hello
print(result.group(2))  # world

# search函数
result = re.search('([a-z]+)-([a-z]+)', 'The quick brown fox jumps over the lazy dog')
print(result.group(0))  # quick brown
print(result.group(1))  # quick
print(result.group(2))  # brown

# findall函数
result = re.findall('([a-z]+)-([a-z]+)', 'hello-world is a great idea')
print(result)  # [('hello', 'world')]

# finditer函数
result = re.finditer('([a-z]+)-([a-z]+)-([a-z]+)', 'hello-world-python')
for match in result:
    print(match.group())  # hello-world-python
    print(match.groups())  # ('hello', 'world', 'python')

正则表达式语法

正则表达式是由字符和特殊字符组成的一种规则表达式，它可以用来匹配、搜索和替换文本。下表列出了一些正则表达式语法的基本元素：

字符	描述
`.`	匹配任意字符（换行符除外）
`^`	匹配字符串开始位置
`$`	匹配字符串结束位置
`*`	匹配前一个字符出现0次或多次
`+`	匹配前一个字符出现1次或多次
`?`	匹配前一个字符出现0次或1次
`{m,n}`	匹配前一个字符出现m次到n次
`[abc]`	匹配方括号中任意一个字符
`(ab)`	匹配括号中的整个正则表达式
`\d`	匹配任意数字字符
`\D`	匹配任意非数字字符
`\w`	匹配任意字符（包括数字和字母）
`\W`	匹配任意非字母数字字符
`\s`	匹配任意空格字符
`\S`	匹配任意非空格字符

下面是一些示例：

import re

# 匹配邮箱
pattern = re.compile('[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
result = pattern.match('python@gmail.com')
print(result.group(0))  # python@gmail.com

# 匹配IP地址
pattern = re.compile('^([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.'
                     '([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.'
                     '([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.'
                     '([01]?\\d\\d?|2[0-4]\\d|25[0-5])$')
result = pattern.match('192.168.1.1')
print(result.group(0))  # 192.168.1.1

# 匹配HTML标签
pattern = re.compile('<[a-z]+>')
result = pattern.search('<p>Hello, world!</p>')
print(result.group(0))  # <p>

替换

正则表达式也可以用于字符串替换。在Python 3中，re模块提供了sub()函数来实现替换。

import re

pattern = re.compile('(blue|white|red)')
result = pattern.sub('color', 'blue socks and red shoes')
print(result)  # color socks and color shoes

修饰符

在正则表达式中，有许多可以在读取或匹配文件时启用或禁用的修饰符。 re模块支持修饰符并且可以使用组合方式。下面是一些修饰符：

修饰符	描述
`re.I`	匹配大小写不敏感
`re.L`	匹配本地化，根据当前环境字符集
`re.M`	多行匹配
`re.S`	匹配包括换行符在内的任意字符
`re.U`	匹配Unicode字符集
`re.X`	忽略正则表达式中的空白和注释

下面是一些示例：

import re

# 大小写匹配
pattern = re.compile('hello', re.I)
result = pattern.match('Hello, world!')
print(result.group())  # Hello

# 多行匹配
pattern = re.compile('^hello', re.M)
result = pattern.findall('hello world\nhello python\nhi there')
print(result)  # ['hello', 'hello']

# .匹配换行符
pattern = re.compile('.+', re.S)
result = pattern.match('hello\nworld')
print(result.group())  # hello\nworld

# 忽略空白和注释
pattern = re.compile('''
                       # 匹配手机号码
                       (0|86|17951)?
                       (\d{11})
                       ''', re.X)
result = pattern.match('13812345678')
print(result.group(2))  # 13812345678

结论

Python 3提供了re模块来支持正则表达式操作。我们可以使用基本函数如compile(), match(), search()等来完成基本匹配。正则表达式语法包含了很多基本元素用于匹配、搜索和替换文本。在替换中，我们可以使用sub()函数来完成替换操作。修饰符除了常见的I、M和S之外，还有其他很多修饰符用于不同的场景。最后，我们需要注意的是，正则表达式可以非常强大，但在处理特定的场景时也容易出现异常，我们需要根据实际情况去处理。