Python 训练一个标记器并在句子中过滤停用词
在自然语言处理中,将文本标记为句子是一个非常关键的预处理任务。标记化是将文本语料库分解为单个句子的过程。在NLTK中,标记化器的默认效果很好,但在文本中包含标点符号、符号等非标准字符的情况下,它可能无法正常工作。在这种情况下,我们需要训练一个标记器。
在本文中,让我们探索如何训练一个标记器,并了解过滤词或停用词的用法。
在自然语言处理中的句子标记化
可以在下面给出的文本样本上使用NLTK中的默认标记器。
Ram – 上个星期天你去了哪里?
Mohan – 我去看泰姬陵。
Ram – 泰姬陵在哪里?
Mohan – 它位于阿格拉。它被认为是世界的奇迹之一。
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
textual_data = """
Ram : Where have gone last Sunday?
Mohan : I went to see the Taj Mahal.
Ram: Where is the Taj Mahal located
Mohan: It is located in Agra.It is considered to be one of the wornders of the world.
"""
sentences = sent_tokenize(textual_data)
print(sentences[0])
print("\n",sentences)
输出
Ram : Where have gone last Sunday?
['\nRam : Where have gone last Sunday?', 'Mohan : I went to see the Taj Mahal.', 'Ram: Where is the Taj Mahal located \n\n\nMohan: It is located in Agra.It is considered to be one of the wonders of the world.']
上一句的输出看起来不正确,因为分词器无法对该文本进行分词,因为该文本不符合正常的段落结构。
这是一个需要训练分词器的情况。
data.txt链接: https://drive.google.com/file/d/1bs2eBbSxTSeaAuDlpoDqGB89Ej9HAqPz/view?usp=sharing 。
训练分词器
在这个例子中,我们将使用Punkt句子分词器。
import nltk
nltk.download('webtext')
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
data = webtext.raw('/content/data.txt')
tokenizer_sentence = PunktSentenceTokenizer(data)
sentences = tokenizer_sentence.tokenize(data)
print(sentences[0])
print("\n",sentences)
输出
Ram : Where have gone last Sunday?
['Ram : Where have gone last Sunday?', 'Mohan : I went to see the Taj Mahal.', 'Ram: Where is the Taj Mahal located?', 'Mohan: It is located in Agra.It is considered to be one of the wonders of the world.']
在句子中过滤停用词
在文本语料库中,那些不添加句子特定含义的单词称为停用词。由于它们对于自然语言处理任务并不关键,通常在预处理过程中从原始文本中删除。NLTK库是包含不同语言对应的停用词集合。
让我们通过一个代码例子来看过滤停用词的过程。
例句:”每一代都会诞生一个新演员,并且受到许多粉丝的崇拜”
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords as sw
from nltk.tokenize import word_tokenize
sentence = "A new actor is born every generation and is worshipped by many fans"
"
stopwords_en = set(sw.words('english'))
word_list = word_tokenize(sentence)
filtered_words = [w for w in word_list if w not in stopwords_en]
print ("Words present in the sentence initially : ",word_list)
print ("\nWords after stopword removal process : ",filtered_words)
输出
Words presents in the sentence initially : ['A', 'new', 'actor', 'is', 'born', 'every', 'generation', 'and', 'is', 'worshipped', 'by', 'many', 'fans']
Words after stopword removal process : ['A', 'new', 'actor', 'born', 'every', 'generation', 'worshipped', 'many', 'fans']
不同类型的标记化
TFIDF标记化
TF-IDF指的是词频-逆文档频率(Term Frequency – Inverse Document Frequency)。它是一种利用词语出现的频率来确定和衡量词语在特定文档中的重要性的算法。它有两个术语TF(词频)和IDF(逆文档频率)。
TF表示词语在特定文档中出现的频率,计算公式如下:
TF = t在d中的频率 / d中的总词数 = tf(t,d)
IDF表示在语料库中有多少文档D中出现了特定t词语,计算公式如下:
IDF = 语料库中的文档总数N / 包含词语t的文档数 = idf(t,D)
TF-IDF是TF和IDF两个术语的乘积。
TF-IDF = tf(t,d) * idf(t,D)
使用Scikit-learn库的TFIDF向量化器的示例
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
corpus = ["Hello how are you","You have called me","How are you"]
data = vectorizer.fit_transform(corpus)
tokens = vectorizer.get_feature_names_out()
print(tokens)
print(data.shape)
输出
['are' 'called' 'have' 'hello' 'how' 'me' 'you']
(3, 7)
频率计数
这种方法用于计算语料库中文档或文本中每个词的频率。
例如,在给定文本中:
狐狸在丛林中行走。然后它看到一只向它走来的老虎。狐狸看到老虎后感到恐惧。
可以得出以下词的频率:
coming: 1
it.The: 1
jungle: 1
seeing: 1
terrified: 1
then: 1
tiger: 2
towards: 1
walking: 1
import nltk
from nltk.corpus import webtext
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import nltk
nltk.download('punkt')
text_data = "The fox was walking in the jungle. It then saw a tiger coming towards it.The fox was terrified on seeing the tiger."
words = word_tokenize(text_data)
print(words)
word_freq = nltk.FreqDist(words)
words_filtered = dict([(i, j) for i, j in word_freq.items() if len(i) > 3])
for k in sorted(words_filtered):
print("%s: %s" % (k, words_filtered[k]))
输出
['The', 'fox', 'was', 'walking', 'in', 'the', 'jungle', '.', 'It', 'then', 'saw', 'a', 'tiger', 'coming', 'towards', 'it.The', 'fox', 'was', 'terrified', 'on', 'seeing', 'the', 'tiger', '.']
coming: 1
it.The: 1
jungle: 1
seeing: 1
terrified: 1
then: 1
tiger: 2
towards: 1
walking: 1
规则基础的分词
基于规则的分词器根据预定义的规则将文本分解成词语。这些规则可以是正则表达式过滤器或语法约束。
例如,可以使用规则将文本按照空格或逗号进行拆分。
此外,某些分词器适用于具有特殊规则的推文,以便拆分单词并保留特殊字符,如表情符号等。
以下是一个基于正则表达式规则的分词器的代码示例。
from nltk.tokenize import regexp_tokenize
data_text = "Jack and Jill went up the hill."
print(regexp_tokenize(data_text, "[\w']+"))
输出
['Jack', 'and', 'Jill', 'went', 'up', 'the', 'hill']
停用词过滤器
停用词是在自然语言处理和文本处理背景下与句子的意义无关且通常被大多数分词工具忽略的常见词汇,它们不会给句子增加任何特殊含义。
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
text_data = "Hello how are you.You have called me. How are you"
data = word_tokenize(text_data)
filtered_words = [w for w in data if w not in stopwords.words('english')]
print(filtered_words)
输出
['Hello', '.', 'You', 'called', '.', 'How']
结论
分句和停用词去除是非常常见且重要的NLP文本预处理步骤。对于简单的语料结构,可以使用默认的句子标记,然而对于不符合常规段落结构的文本,甚至可以训练一个分词器。停用词不对句子的意义做出贡献,因此在文本预处理过程中被过滤掉。