如何使用正则表达式在Python中匹配单词

掌握正则表达式及其’re’模块将在Python中打开强大的文本处理可能性。正则表达式，通常称为regex，使我们能够识别、搜索和操作字符串中的特定模式。我们在工作中经常遇到的一个常见任务是使用正则表达式在文本中匹配特定的单词。在本文中，我们将深入探讨在Python中使用正则表达式查找和匹配字符串中的单词的艺术。我们将使用一些代码示例来探索这个领域，每个示例后面都有逐步解释，这一定会引导您进行这个令人兴奋的正则表达式的单词匹配之旅。

匹配一个简单的单词

例子

在第一个代码示例中，我们首先导入’re’模块；这个模块允许我们在Python中使用正则表达式。我们的目标是在给定的文本中匹配单词”fox”。
为了创建正则表达式模式，我们使用re.escape()函数来确保单词中的任何特殊字符都被视为文字字符。这是为了避免如果单词包含正则表达式元字符而产生意外行为。
模式r”\b” + re.escape(word_to_match) + r”\b”使用\b单词边界锚点来将单词”fox”作为一个完整的单词进行匹配。\b锚点确保该单词不是更长单词的一部分，并且被非单词字符或字符串的开头/结尾所包围。
然后，我们利用re.search()函数在文本中找到第一个匹配单词的位置。如果找到匹配，我们使用match.group()输出匹配的单词。否则，我们打印”Word not found.”。

import re

# Sample text
text = "The quick brown fox jumps over the lazy dog."

# The word we want to match
word_to_match = "fox"

# Regular expression pattern to match the word
pattern = r"\b" + re.escape(word_to_match) + r"\b"

# Find the word in the text
match = re.search(pattern, text)

# Output the match
if match:
    print("Word found:", match.group())
else:
    print("Word not found.")

输出

Word found: fox

不区分大小写的单词匹配

示例

在这段代码片段中，我们有一段提到Python编程语言的示例文本。我们的目标是以不区分大小写的方式匹配单词”Python”。这意味着正则表达式应该能够在文本中找到”Python”，无论它是作为”Python”还是”python”出现。
为了实现不区分大小写，我们在re.search()函数的第三个参数中使用了re.IGNORECASE标志。该标志指示正则表达式引擎在搜索单词时忽略大小写。
代码的其余部分与前一个示例类似。我们使用单词边界锚定符创建正则表达式模式，并使用re.escape()确保安全匹配单词。然后，我们进行搜索并相应地输出结果。

import re

# Sample text
text = "The Python programming language is versatile and powerful."

# The word we want to match (case-insensitive)
word_to_match = "python"

# Regular expression pattern for case-insensitive word matching
pattern = r"\b" + re.escape(word_to_match) + r"\b"

# Find the word in the text (case-insensitive)
match = re.search(pattern, text, re.IGNORECASE)

# Output the match
if match:
    print("Word found:", match.group())
else:
    print("Word not found.")

输出

Word found: Python

匹配具有变体拼写的单词

示例

在这个示例中，我们有一个包含单词”color”和”colour”变体拼写的样本文本。我们的任务是无视大小写，匹配这两种拼写。
为了匹配变体拼写，我们使用|（竖线）符号来表示OR运算符创建一个正则表达式模式。这样我们就可以指定单词的替代拼写。我们还使用re.IGNORECASE标志来确保大小写不敏感的匹配。
模式r”\b(” + re.escape(word_to_match) + r”)\b”使用了单词边界锚点，确保我们匹配整个单词，而不是其部分。
我们使用re.findall()在文本中找到所有变体拼写的出现，并将匹配结果存储在matches变量中。最后，我们用逗号和空格将匹配到的单词连接起来输出。

import re

# Sample text with variant spellings of a word
text = "Color or colour, which one do you prefer?"

# The word we want to match (variant spellings)
word_to_match = "color|colour"

# Regular expression pattern to match variant spellings
pattern = r"\b(" + re.escape(word_to_match) + r")\b"

# Find the word in the text
matches = re.findall(pattern, text, re.IGNORECASE)

# Output the matches
if matches:
    print("Words found:", ", ".join(matches))
else:
    print("Word not found.")

输出

Word not found.

匹配带有前缀或后缀的单词

例子

在倒数第二个例子中，我们有一个包含带有前缀或后缀的单词的示例文本。我们的目标是匹配单词”uncomplete”，无论它是否带有任何前缀或后缀都可以。
为了实现这一目标，我们使用\w*（零个或多个单词字符）在我们想要匹配的单词的两侧创建一个正则表达式模式。re.IGNORECASE标志确保匹配时不区分大小写。
模式r”\b\w” + re.escape(word_to_match) + r”\w\b”使用单词边界锚点和\w*来匹配整个单词，即使它之前或之后有字符也可以。
我们使用re.findall()在文本中查找带有前缀或后缀的单词的所有出现，并将匹配项存储在matches变量中。最后，我们输出匹配到的单词，用逗号和空格连接它们。

import re

# Sample text with words having prefixes or suffixes
text = "The project is uncompleted, but they're working on it."

# The word with prefixes or suffixes we want to match
word_to_match = "uncomplete"

# Regular expression pattern to match word with prefixes or suffixes
pattern = r"\b\w*" + re.escape(word_to_match) + r"\w*\b"

# Find the word in the text
matches = re.findall(pattern, text, re.IGNORECASE)

# Output the matches
if matches:
    print("Words found:", ", ".join(matches))
else:
    print("Word not found.")

输出

Words found: uncompleted

匹配长度可变的单词

例子

在最后一个例子中，我们有一个包含了单词”sun”的示例文本，但该单词在文本中的上下文有所不同。我们的任务是无论单词在文本中的位置和长度如何，都要匹配到单词”sun”。
为了实现这个目标，我们使用单词边界锚点\b创建一个正则表达式模式，以确保匹配整个单词。通常情况下，我们使用re.escape()来安全地处理单词中的特殊字符，并使用re.IGNORECASE进行不区分大小写的匹配。
模式r"\b" + re.escape(word_to_match) + r"\b"将匹配单词”sun”在文本中的任何位置作为完整的单词。
我们使用re.findall()查找文本中所有出现的单词”sun”，无论它们的位置和长度如何。匹配结果存储在matches变量中，我们将结果以逗号和空格相连输出。

import re

# Sample text with words of varying lengths
text = "The sun sets early in summer, but late in winter."

# The word we want to match with variable lengths
word_to_match = "sun"

# Regular expression pattern to match word with variable lengths
pattern = r"\b" + re.escape(word_to_match) + r"\b"

# Find the word in the text
matches = re.findall(pattern, text, re.IGNORECASE)

# Output the matches
if matches:
    print("Words found:", ", ".join(matches))
else: print("Word not found.")