Python 找出文件中与输入句子最相似的句子

自然语言处理(NLP)允许计算机解释和分析人类语言。找到与给定输入句子最相似的词或句子是一个常见的NLP问题。在Python中，有各种方法可以找到相同的句子。

所需资源

要完成这个任务，您需要在您的系统中安装nltk库。因此，在Python命令提示符中运行以下命令来安装nltk。

pip install nltk

如果上面的命令执行失败，您也可以在Windows命令提示符中运行以下命令。

python --version
pip --version
pip install nltk

一旦成功安装了该库，我们就可以在代码中导入它并使用nltk的各种模块来编写一个句子查找程序。

示例

我们将创建一个Python程序，该程序从用户那里接收输入的句子，并从文件中找到最相似的句子。让我们来看看如何使用Python的NLTK包来实现这一点。我们将特别使用TF-IDF（词频-逆文档频率）方法和各种自然语言处理预处理步骤。

步骤

步骤1： 安装和导入NLTK。您可以使用上述任何方法。

步骤2： 编写代码从文件中加载句子。加载句子，然后生成一个预处理句子列表，每个句子都被剥去任何前导或后续的空格。

步骤3： 处理输入句子和文件中的剥离句子。

步骤4： 对每个句子进行分词，将其分解为单词。

步骤5： 从句子中移除停用词，以比较主要词汇。

步骤6： 比较单词，并对它们进行加权以找到权重最高的单词。通过这样做，您可以找到文件中最相似的句子。

示例

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Download NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Load the file containing sentences
def load_sentences(file_path):
   with open(file_path, 'r') as file:
    sentences = file.readlines()
   return [sentence.strip() for sentence in sentences]

# Preprocess the input sentence
def preprocess_sentence(sentence):
   # Tokenize
   tokens = word_tokenize(sentence.lower())

   # Remove stopwords
   stop_words = set(stopwords.words('english'))
   tokens = [token for token in tokens if token not in stop_words]

   # Lemmatize
   lemmatizer = WordNetLemmatizer()
   tokens = [lemmatizer.lemmatize(token) for token in tokens]

   return ' '.join(tokens)

# Get the most similar sentence
def get_most_similar_sentence(user_input, sentences):
   # Preprocess input sentence
   preprocessed_user_input = preprocess_sentence(user_input)

   # Preprocess sentences
   preprocessed_sentences = [preprocess_sentence(sentence) for sentence in 
sentences]

   # Create TF-IDF vectorizer
   vectorizer = TfidfVectorizer()

   # Generate TF-IDF matrix
   tfidf_matrix = vectorizer.fit_transform([preprocessed_user_input] + 
preprocessed_sentences)

   # Calculate similarity scores
   similarity_scores = (tfidf_matrix * tfidf_matrix.T).A[0][1:]

   # Find the index of the most similar sentence
   most_similar_index = similarity_scores.argmax()
   most_similar_sentence = sentences[most_similar_index]

   return most_similar_sentence

# Main program
def main():
   file_path = 'sentences.txt'  # Path to the file containing sentences
   sentences = load_sentences(file_path)

   user_input = 'hello I am a women' 

   most_similar_sentence = get_most_similar_sentence(user_input, sentences)
   print('Most similar sentence:', most_similar_sentence)

if __name__ == '__main__':
   main()

文本文件内容 ：Sentences.txt

这是喜剧电影。

这是恐怖电影。

你好，我是一个女孩。

你好，我是一个男孩。