Python 使用朴素贝叶斯方法对文本文档进行分类

朴素贝叶斯算法是一个强大的工具，可以用来对文档或文本中的单词进行不同类别的分类。例如，如果一个文档中有”潮湿”、”下雨”或”多云”等词，我们可以使用贝叶斯算法来判断该文档是否属于”晴天”或”雨天”这样的分类。

需要注意的是，朴素贝叶斯算法基于两个待比较文档的单词是相互独立的假设。然而，考虑到语言的细微差别，这个假设很少是真实的。这就是为什么该算法的名称中带有”朴素”一词的原因，但它的表现仍然足够好。

步骤

步骤1 - 输入文档数量、文本字符串和对应类别的信息。将文本和关键词分割，并使用列表对其进行处理，输入需要分类的字符串/文本。
步骤2 - 创建一个列表，其中将存储每个文档中所有关键词的频率。使用pretty table库以表格形式打印出来。根据需要命名表头。
步骤3 - 计算每个类别（正面和负面）的总单词数量和文档数量。
步骤4 - 计算每个单词的概率，并四舍五入保留四位小数。
步骤5 - 使用贝叶斯公式计算类别概率，并四舍五入保留八位小数。
步骤6 - 使用贝叶斯公式计算类别概率，并四舍五入保留八位小数。
步骤7 - 针对负面类别重复上述两个步骤。
步骤8 - 比较两个类别的概率结果并打印结果。

示例

在这个示例中，为了简单和易于理解，我们将只采用两个包含一句话的文档，并对与这两个句子相似的字符串进行朴素贝叶斯分类。每个文档都有一个类别，我们的目标是判断待测试的字符串属于哪个类别。

#Step 1 - Input the required data and split the text and keywords
total_documents = 2
text_list = ["they love laugh and pray", "without faith you suffer"]
category_list = ["Positive", "Negative"]
doc_class = []
i = 0
keywords = []
while not i == total_documents:
   doc_class.append([])
   text = text_list[i]
   category = category_list[i]
   doc_class[i].append(text.split())
   doc_class[i].append(category)
   keywords.extend(text.split())
   i = i+1
keywords = set(keywords)
keywords = list(keywords)
keywords.sort()
to_find = "suffer without love laugh and pray"

#step 2 - make frequency table for keywords and print the table
probability_table = []
for i in range(total_documents):
   probability_table.append([])
   for j in keywords:
      probability_table[i].append(0)
doc_id = 1
for i in range(total_documents):
   for k in range(len(keywords)):
      if keywords[k] in doc_class[i][0]:
         probability_table[i][k] += doc_class[i][0].count(keywords[k])
print('\n')
import prettytable
keywords.insert(0, 'Document Number')
keywords.append("Class/Category")
Prob_Table = prettytable.PrettyTable()
Prob_Table.field_names = keywords
Prob_Table.title = 'Probability table'
x=0
for i in probability_table:
   i.insert(0,x+1)
   i.append(doc_class[x][1])
   Prob_Table.add_row(i)
   x=x+1
print(Prob_Table)
print('\n')
for i in probability_table:
   i.pop(0)

#step 3 - count the words and documents based on categories    
totalpluswords=0
totalnegwords=0
totalplus=0
totalneg=0
vocabulary=len(keywords)-2
for i in probability_table:
   if i[len(i)-1]=="+":
      totalplus+=1
      totalpluswords+=sum(i[0:len(i)-1])
   else:
      totalneg+=1
      totalnegwords+=sum(i[0:len(i)-1])
keywords.pop(0)
keywords.pop(len(keywords)-1)

#step - 4 Find probability of each word for positive class
temp=[]
for i in to_find:
   count=0
   x=keywords.index(i)
   for j in probability_table:
      if j[len(j)-1]=="Positive":
         count=count+j[x]
   temp.append(count)
   count=0
for i in range(len(temp)):
   temp[i]=format((temp[i]+1)/(vocabulary+totalpluswords),".4f")
print()
temp=[float(f) for f in temp]
print("Probabilities of each word in the 'Positive' category are: ")
h=0
for i in to_find:
   print(f"P({i}/+) = {temp[h]}")
   h=h+1
print()

#step - 5 Find probability of class using Bayes formula
prob_pos=float(format((totalplus)/(totalplus+totalneg),".8f"))
for i in temp:
   prob_pos=prob_pos*i
prob_pos=format(prob_pos,".8f")
print("Probability of text in 'Positive' class is :",prob_pos)
print()

#step - 6 Repeat above two steps for the negative class
temp=[]
for i in to_find:
   count=0
   x=keywords.index(i)
   for j in probability_table:
      if j[len(j)-1]=="Negative":
         count=count+j[x]
   temp.append(count)
   count=0
for i in range(len(temp)):
   temp[i]=format((temp[i]+1)/(vocabulary+totalnegwords),".4f")
print()
temp=[float(f) for f in temp]
print("Probabilities of each word in the 'Negative' category are: ")
h=0
for i in to_find:
   print(f"P({i}/-) = {temp[h]}")
   h=h+1
print()
prob_neg=float(format((totalneg)/(totalplus+totalneg),".8f"))
for i in temp:
   prob_neg=prob_neg*i
prob_neg=format(prob_neg,".8f")
print("Probability of text in 'Negative' class is :",prob_neg)
print('\n')

#step - 7 Compare the probabilities and print the result 
if prob_pos>prob_neg:
   print(f"By Naive Bayes Classification, we can conclude that the given belongs to 'Positive' class with the probability {prob_pos}")
else:
   print(f"By Naive Bayes Classification, we can conclude that the given belongs to 'Negative' class with the probability {prob_neg}")
print('\n')

我们对每个文档进行迭代，将关键词存储在一个单独的列表中。通过对文档进行迭代，我们存储关键词的频率并绘制一个概率表。该代码计算文档中正面和负面单词的数量，并确定唯一关键词的大小。

然后，我们计算正面类别中每个关键词的概率，并对输入文本中的关键词进行迭代，计算在正面类别中的出现次数。结果概率然后存储在一个新的列表中。然后，我们使用贝叶斯公式计算属于正面类别的输入文本的概率。类似地，我们计算负面类别中每个关键词的概率并存储它们。然后我们比较两个类别的概率，并确定概率较高的类别。

输出

Python 使用朴素贝叶斯方法对文本文档进行分类

Probabilities of each word in the 'Positive' category are: 
P(suffer/+) = 0.1111
P(without/+) = 0.1111
P(love/+) = 0.2222
P(laugh/+) = 0.2222
P(and/+) = 0.2222
P(pray/+) = 0.2222

Probability of text in 'Positive' class is : 0.00000000

Probabilities of each word in the 'Negative' category are: 
P(suffer/-) = 0.1111
P(without/-) = 0.1111
P(love/-) = 0.0556
P(laugh/-) = 0.0556
P(and/-) = 0.0556
P(pray/-) = 0.0556

Probability of text in 'Negative' class is : 0.00000012

By Naive Bayes Classification, we can conclude that the given belongs to 'Negative' class with the probability 0.00000012