Python 在自然语言处理中训练Unigram Tagger

一个单词被称为一个unigram。Unigram tagger是一种只需要一个词来推断一个词的词性的tagger。它只有一个词的上下文信息。NLTK库提供了UnigramTagger，它是从NgramTagger继承而来的。

在本文中，让我们了解Unigram Tagger的训练过程。

使用NLTK训练Unigram Tagger

工作原理

UnigramTagger继承自ContextTagger。实现了context()方法。该方法的参数与choose_tag()方法相同。
从context()方法中，将使用一个词标记来创建模型。这个词被用来寻找最佳的标记。
UnigramTagger将创建一个带有上下文的模型。

Python实现

import nltk
nltk.download('treebank')
from nltk.tag import UnigramTagger
from nltk.corpus import treebank as tb
sentences_trained = treebank.tagged_sents()[:4000]
uni_tagger = UnigramTagger(sentences_trained)
print("Sample Sentence : ",tb.sents()[1])
print("Tag sample sentence : ", uni_tagger.tag(tb.sents()[1]))

输出

Sample Sentence :  ['Mr.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N.V.', ',', 'the', 'Dutch', 'publishing', 'group', '.']
Tag sample sentence :  [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'JJ'), ('publishing', 'NN'), ('group', 'NN'), ('.', '.')]

在上面的代码示例中，第一个Unigram标签器是基于Treebank的前4000个句子进行训练的。一旦句子被训练，它们就可以使用相同的标签器来标记任何一个句子。在上面的示例代码中使用了第1个句子。

下面的代码示例可以用来测试Unigram标签器并进行评估。

from nltk.corpus import treebank as tb
sentences_trained = treebank.tagged_sents()[:4000]
uni_tagger = UnigramTagger(sentences_trained)
sent_tested = treebank.tagged_sents()[3000:]
print("Test score : ",uni_tagger.evaluate(sent_tested))

输出

Test score :  0.96

在上面的代码示例中，unigram标注器在4000个句子上进行训练，然后在最后1000个句子上进行评估。

平滑技术

在许多情况下，我们需要在NLP中构建统计模型，例如基于训练数据预测下一个单词或自动完成句子。在这么多单词组合或可能性的宇宙中，得到最准确的单词预测是不可或缺的。在这种情况下，可以使用平滑。平滑是一种调整训练模型中的概率的方法，以便更准确地预测单词，甚至预测训练语料库中不存在的适当单词。

平滑的类型

拉普拉斯平滑

也称为加1平滑，我们在分母中将单词计数加1，以避免出现0值或除以0的情况。例如，

Problaplace (wi | w(i-1)) = (count(wi w(i-1)) +1 ) / (count(w(i-1)) + N)
N = total words in the training corpus
Prob("He likes coffee")
= Prob( I | <S>)* Prob( likes | I)* Prob( coffee | likes)* Prob(<E> | coffee)
= ((1+1) / (4+6))   *  ((1+1) / (1+8))  *  ((0+1) / (1+5))  *  ((1+1) / (4+8))
= 0.00123