Python中的Difflib模块|极客笔记

Python中的Difflib模块

在以下教程中，我们将了解Python编程语言中的Difflib模块。我们将讨论该模块的功能以及基于其类的一些示例。

所以，让我们开始吧。

理解Python Difflib模块

Difflib 是Python编程语言中的内置模块，包含了不同的简单功能和类，允许用户比较数据集。该模块以一种可以被人类读取的格式提供这些序列比较的输出，使用增量来更高效地显示差异。

difflib模块通常用于比较字符串的序列。但只要它们是可哈希的，我们也可以使用它来比较其他数据类型。我们知道，如果对象的哈希值在其生命周期内没有改变，则该对象是可哈希的。

Python difflib模块中最常用的类是Differ和 Sequence Matcher类 。还有一些其他 辅助类 和函数可用于更具体的操作。让我们在以下部分了解一些这些函数。

理解Sequence Matcher类

让我们首先从difflib模块的一个相当自明的方法开始: SequenceMatcher 。SequenceMatcher方法将比较两个提供的字符串，并返回代表两个字符串之间相似性的数据。让我们尝试使用 ratio() 对象来调用该方法。该对象将以小数格式返回比较数据。下面是一个示例:

示例:

# importing the difflib library and SequenceMatcher class
import difflib
from difflib import SequenceMatcher

# defining the strings
str_1 = "Welcome to Javatpoint"
str_2 = "Welcome to Python tutorial"

# using the SequenceMatcher() function
my_seq = SequenceMatcher(a = str_1, b = str_2)

# printing the result
print("First String:", str_1)
print("Second String:", str_2)
print("Sequence Matched:", my_seq.ratio())

输出：

First String: Welcome to Javatpoint
Second String: Welcome to Python tutorial
Sequence Matched: 0.5106382978723404

解释:

在上面的代码片段中，我们首先导入了 difflib 模块，并引入了 SequenceMatcher 类。然后，我们定义了两个字符串值，我们将使用该类将它们进行比较。然后，我们创建了一个新变量，该变量封装了 SequenceMatcher 类，并传入了两个参数 a 和 b 。虽然该方法实际上接受三个参数： None, a 和 b 。

为了让方法识别这两个字符串，我们必须将字符串的值分别赋值给方法的变量 SequenceMatcher(a = str_1, b = str_2) 。

一旦所有需要的变量都被定义，并且 SequenceMatcher 已经至少提供了两个参数，我们现在可以打印该值，使用我们之前提到的 ratio() 对象。该对象会计算两个字符串中相同字符的比率，并以小数的形式返回输出。就这样，我们比较了两个简单的字符串并得到了它们的相似度输出。

注意：ratio()对象是与SequenceMatcher类相关联的几个对象之一。我们可以查看Python官方文档，了解更多的这些对象，以执行不同的序列操作。

理解Differ类

Differ 类被认为是 SquenceMatcher 的相反；它接收文本行并查找字符串之间的差异。但是 Differ 类的特殊之处在于其使用差分，使其更高效且更易于人类识别差异。

例如，在比较两个字符串时，如果将新字符插入到第二个字符串中，则会在接收到额外字符的行的前面出现 ‘ + ‘。

正如我们可能已经猜到的，如果从第一个字符串中删除了一些可见字符，则会在第二行文本之前出现 ‘ – ‘。

如果两个序列中的某一行相同，则会返回 ‘ ‘，如果缺少一行，则会出现 ‘ ? ‘。此外，我们还可以使用 ratio() 等属性，如前面的示例所讨论的那样。

让我们考虑以下示例，以了解 Differ 类的工作原理。

示例:

# importing the difflib module and Differ class
import difflib
from difflib import Differ

# defining the strings
str_1 = "They would like to order a soft drink"
str_2 = "They would like to order a corn pizza"

# using the splitlines() function
lines_str1 = str_1.splitlines()
lines_str2 = str_2.splitlines()

# using the Differ() and compare() function
dif = difflib.Differ()
my_diff = dif.compare(lines_str1, lines_str2)

# printing the results
print("First String:", str_1)
print("Second String:", str_2)
print("Difference between the Strings")
print('\n'.join(my_diff))

输出：

First String: They would like to order a soft drink
Second String: They would like to order a corn pizza
Difference between the Strings
- They would like to order a soft drink
?                            ^ ^^ ^^ ^^

+ They would like to order a corn pizza
?                            ^ ^^ ^ ^^^

解释：

在上述的代码片段中，我们导入了 difflib 模块，并且同时导入了 Differ 类。接下来我们定义了两个要比较的字符串。然后我们调用了 splitlines() 函数来分割这两个字符串。

语法：

lines_str1 = str_1.splitlines()
lines_str2 = str_2.splitlines()

此函数允许我们按每行而不是每个字符来比较字符串。

一旦我们定义了一个变量，其中包含 Differ 类，我们创建另一个变量，其中包含 Differ 的 compare() 对象，并将两个字符串作为参数传入。

语法：

my_diff = dif.compare(lines_str1, lines_str2)

我们调用 print() 函数，并用一个换行符将 my_diff 变量连接起来，这样输出的格式更易读。

理解 get_close_matches 方法

difflib 模块提供了另一个简单但功能强大的工具，即 get_close_matches 方法。该方法的功能就是它的名字所描述的：接受参数并返回与目标字符串最接近的匹配项。伪代码中，函数的运行方式如下：

语法:

get_close_matches(target_word, list_of_possibilities, n = res_limit, cutoff)

如上所示的语法， get_close_matches() 方法接受四个参数；然而，它只需要前两个参数以返回输出。

第一个参数是需要定位的单词；我们希望该方法返回相似性。第二个参数可以是一个指向字符串数组的变量或术语数组。第三个参数允许用户定义返回的输出数量的限制。最后一个参数确定两个单词之间的相似性需要达到多高才能返回作为输出。

只使用前两个参数，该函数将根据默认的断点（在 0-1 的范围内）和默认的结果限制（ 3 ）返回输出。我们来看下面的示例来了解该函数的工作原理。

示例：

# importing the difflib module and get_close_matches method
import difflib
from difflib import get_close_matches

# using the get_close_matches method
my_list = get_close_matches('mas', ['master', 'mask', 'duck', 'cow', 'mass', 'massive', 'python', 'butter'])

# printing the list
print("Matching words:", my_list)

输出：

Matching words: ['mass', 'mask', 'master']

解释：

在上面的代码片段中，我们导入了 difflib 模块和 get_close_matches 方法。然后我们使用 get_close_matches() 方法在具有一些相似字符的项列表上。执行程序后，函数将只返回三个具有相似字母的单词，即使有第四个与单词“ mas ”相似的项目： massive 。现在，让我们尝试在以下示例中定义一个 result_limit 和一个 cutoff ：

示例：

# importing the difflib module and get_close_matches method
import difflib
from difflib import get_close_matches

# using the get_close_matches method
my_list = get_close_matches(
    'mas',
    ['master', 'mask', 'duck', 'cow',
    'mass', 'massive', 'python', 'butter'],
    n = 4,
    cutoff = 0.6
    )

# printing the list
print("Matching words:", my_list)

输出：

Matching words: ['mass', 'mask', 'master', 'massive']

解释：

在上面的代码片段中，我们产生了四个结果，它们与单词“mas”至少相似度达到 60% 。这个 cutoff 与默认值一样，我们刚刚定义了相同的值 0.6 。不过，我们可以更改这个参数来使结果更严格或更宽松。值越接近 1 ，约束就越严格。

了解unified_diff和context_diff类

在 difflib 中有两个类以完全相同的方式工作：unified_diff和context_diff。两者之间唯一的主要区别是结果的形式。

unified_diff类接受两个数据字符串，然后返回从第一个字符串中插入或删除的每个单词。

让我们考虑以下示例以更好地理解。

示例：

# importing the required modules
import sys
import difflib
from difflib import unified_diff

# defining the string variables
str_1 = ['Mark\n', 'Henry\n', 'Richard\n', 'Stella\n', 'Robin\n', 'Employees\n']
str_2 = ['Arthur\n', 'Joseph\n', 'Stacey\n', 'Harry\n', 'Emma\n', 'Employees\n']

# using the unified_diff() function
sys.stdout.writelines(unified_diff(str_1, str_2))

输出：

--- 
+++ 
@@ -1,6 +1,6 @@
-Mark
-Henry
-Richard
-Stella
-Robin
+Arthur
+Joseph
+Stacey
+Harry
+Emma
 Employees

说明：

在上面的代码段中，我们导入了所需的模块并定义了两个变量存储一些单词。然后我们使用了unified_diff()函数将第一个变量中的单词删除并将第二个变量中的单词添加到第一个变量中。结果我们可以观察到， unified_diff 函数返回的是以 – 为前缀的删除的单词和以 + 为前缀的添加的单词。最后一个单词 ‘Employees’ 在两个字符串中都没有前缀。

context_diff 类的工作方式与 unified_diff 类似。然而，它不会显示原始字符串中被插入和删除的内容，它只会返回哪些行发生了变化，并用 ! 作为前缀返回变化的行。

让我们来看下面的示例来理解这个类的工作原理。

示例：

# importing the required modules
import sys
import difflib
from difflib import context_diff

# defining the string variables
str_1 = ['Mark\n', 'Henry\n', 'Richard\n', 'Stella\n', 'Robin\n', 'Employees\n']
str_2 = ['Arthur\n', 'Joseph\n', 'Stacey\n', 'Harry\n', 'Emma\n', 'Employees\n']

# using the context_diff() function
sys.stdout.writelines(context_diff(str_1, str_2))

输出：

*** 
--- 
***************
*** 1,6 ****
! Mark
! Henry
! Richard
! Stella
! Robin
  Employees
--- 1,6 ----
! Arthur
! Joseph
! Stacey
! Harry
! Emma
  Employees

解释：

在上面的示例中，我们使用了 context_diff 来删除和添加第一个字符串中的单词。可以观察到，被改变的单词会在前面加上’!’的前缀来描述。