Python 查找文本文件中唯一单词的数量程序

在本文中，给定的任务是查找文本文件中唯一单词的数量。在这篇Python文章中，通过两个不同的示例，介绍了在文本文件中找到唯一单词及其数量的方法。在第一个示例中，从文本文件中提取给定的单词，然后在计算这些唯一单词之前进行了唯一集合的创建。在第二个示例中，首先创建单词列表，然后对其进行排序。在对排序的列表进行操作后，删除重复项，最后计算文件中剩余的唯一单词以得出最终结果。

预处理算法

步骤1 - 使用Google账号登录。转到Google Colab。打开一个新的Colab笔记本并在其中编写Python代码。

步骤2 - 首先将txt文件”file1.txt”上传到Google Colab。

步骤3 - 打开待读取的txt文件。

步骤4 - 将文本文件转换为小写。

步骤5 - 使用split函数将txt文件中的单词分开。

步骤6 - 打印名为’words_in_file’的包含文本文件中单词的列表。

这些示例使用的文本文件

文件file1.txt中的内容如下…

This is a new file.
This is made for testing purposes only.
There are four lines in this file.
There are four lines in this file.
There are four lines in this file.
There are four lines in this file.
Oh! No.. there are seven lines now.

将文件1.txt上传到Colab

Python 查找文本文件中唯一单词的数量程序

图：在Google Colab中上传文件1.txt

方法1：使用Python集合找到文本文件中的唯一词数

预处理步骤后，使用以下步骤进行方法1

步骤1 - 从预处理步骤开始，使用列表’words_in_file’。

步骤2 - 将此列表转换为集合。这里，集合只包含唯一的单词。

步骤3 - 使用打印语句显示包含所有唯一单词的集合。

步骤4 - 找到集合的长度。

步骤5 - 打印集合的长度。

步骤6 - 这将给出给定字符串中唯一单词的数目。

示例

# Use open method to open the respective text file
file = open("file1.txt", 'r')

#Conversion of its content to lowercase
thegiventxtfile = file.read().lower()

#ALter the sentences to the list of words
words_in_file = thegiventxtfile.split()

print("The given txt file content is :\n")
print(thegiventxtfile)
print("\nThe words given in the txt file are :\n")
print(words_in_file)
print("\nThe unique words given in this txt file are :\n")

#Convert to the python set
uniqueWords=set(words_in_file)

print(uniqueWords) 

#Find the number of words left in this list
numberofuniquewords=len(uniqueWords)

print("\nThe number of unique words given in this txt file are :\n")
print(numberofuniquewords)

输出

The given txt file content is :

this is a new file.
this is made for testing purposes only.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
oh! no.. there are seven lines now.


The words given in the txt file are :

['this', 'is', 'a', 'new', 'file.', 'this', 'is', 'made', 'for', 'testing', 'purposes', 'only.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'oh!', 'no..', 'there', 'are', 'seven', 'lines', 'now.']

The unique words given in this txt file are :

{'there', 'only.', 'testing', 'new', 'is', 'for', 'oh!', 'this', 'a', 'made', 'seven', 'are', 'purposes', 'in', 'file.', 'four', 'now.', 'no..', 'lines'}

The number of unique words given in this txt file are :

19

方法2：使用Python字典在文本文件中找到唯一单词的数量

步骤1 - 打开所需的文件。

步骤2 - 对列表进行排序并打印该列表。按字母顺序排序的列表将显示重复的单词。

步骤3 - 现在，为了去除重复的单词并只保留唯一的单词使用dict.fromkeys(words_in_file)。

步骤4 - 现在必须将其转换回列表。

步骤5 - 最后打印包含唯一单词的列表。

步骤6 - 计算最终列表的长度并显示其值。这将给出给定字符串中唯一单词的数量。

示例

#Open the text file in read mode
file = open("file1.txt", 'r')

#Convert its content to lowercase
thegiventxtfile = file.read().lower()

#Change the sentences to the list of words
words_in_file = thegiventxtfile.split()

print("The given txt file content is :\n")
print(thegiventxtfile)
print("\nThe words given in the txt file are :\n")
print(words_in_file)
print("\nThe sorted words list from this txt file is :\n")

#Sort this words file now
words_in_file.sort()

print(words_in_file)
print("\nThe sorted words list after removing duplicates from this txt file is :\n")

#Get rid of the duplicate words
myuniquewordlist = list(dict.fromkeys(words_in_file))

#Count the number of words left
numberofuniquewords=len(uniqueWords)

print(myuniquewordlist) 
print("\nThe number of unique words given in this txt file are :\n")

输出

The given txt file content is :

this is a new file.
this is made for testing purposes only.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
oh! no.. there are seven lines now.


The words given in the txt file are :

['this', 'is', 'a', 'new', 'file.', 'this', 'is', 'made', 'for', 'testing', 'purposes', 'only.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'oh!', 'no..', 'there', 'are', 'seven', 'lines', 'now.']

The sorted words list from this txt file is :

['a', 'are', 'are', 'are', 'are', 'are', 'file.', 'file.', 'file.', 'file.', 'file.', 'for', 'four', 'four', 'four', 'four', 'in', 'in', 'in', 'in', 'is', 'is', 'lines', 'lines', 'lines', 'lines', 'lines', 'made', 'new', 'no..', 'now.', 'oh!', 'only.', 'purposes', 'seven', 'testing', 'there', 'there', 'there', 'there', 'there', 'this', 'this', 'this', 'this', 'this', 'this']

The sorted words list after removing duplicates from this txt file is :

['a', 'are', 'file.', 'for', 'four', 'in', 'is', 'lines', 'made', 'new', 'no..', 'now.', 'oh!', 'only.', 'purposes', 'seven', 'testing', 'there', 'this']

The number of unique words given in this txt file are :

19

结论

有两种不同的方法来展示如何在给定的txt文件中找到唯一的单词。首先，将txt文件上传到colab笔记本中。然后打开此文件进行读取。然后将此文件拆分，并将单词分隔开并存储为列表。在这篇Python文章中，此单词列表在两个示例中都被使用到。

在示例1中，使用了Python集合的概念。列表中可能包含重复的单词。当将此列表转换为集合时，只有唯一的单词会被保留下来。为了计算唯一单词的数量，使用了len()函数。在示例2中，从txt文件中获取的单词列表首先被排序，以看到重复单词的数量，这些重复单词在排序后放在一起。现在，使用dict.fromkeys(words_in_file)来移除重复的单词，并将排序后的列表用于找到重复单词的数量。