使用Python中的Whoosh库开发文本搜索引擎

Whoosh是一个用于索引文本和搜索索引的Python库，它包含了一系列的类和函数。假设你正在构建一个需要浏览各种文档并根据一些预定义条件寻找相似性或从中获取数据的应用程序，或者说你想计算研究论文中项目标题被提到的次数，那么本教程中正在构建的内容会对你很有帮助。

入门

为了构建我们的文本搜索引擎，我们将使用Whoosh库。

该库并不随Python预装。因此，我们将使用pip软件包管理器下载和安装它。

要安装Whoosh库，请使用以下命令：

pip install whoosh

现在，我们可以使用下面的代码将其导入到我们的脚本中。

from whoosh.fields import Schema, TEXT, ID
from whoosh import index

使用Python构建文本搜索引擎

首先，让我们定义一个文件夹，在需要时保存已索引的文件。

import os.path
os.mkdir("dir")

接下来，让我们定义一个模式。模式指定索引中文档的字段。

schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))
ind = index.create_in("dir", schema)
writer = ind.writer()
writer.add_document(title=u"doc", content=u"Py doc hello big world", path=u"/a") 
writer.commit()

现在我们已经对文档进行了索引，我们进行搜索。

from whoosh.qparser import QueryParser
with ind.searcher() as searcher:
     query = QueryParser("content", ind.schema).parse("hello world")
     results = searcher.search(query, terms=True)
     for r in results:
         print (r, r.score)
         if results.has_matched_terms():
            print(results.matched_terms())

输出

它将产生以下输出:

<Hit {'path': '/a', 'title': 'doc', 'content': 'Py doc hello big world'}> 
1.7906976744186047
{('content', b'hello'), ('content', b'world')}

示例

以下是完整的代码：

from whoosh.fields import Schema, TEXT, ID
from whoosh import index
import os.path
os.mkdir("dir")
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))
ind = index.create_in("dir", schema)
writer = ind.writer()
writer.add_document(title=u"doc", content=u"Py doc hello big world", path=u"/a") 
writer.commit()

from whoosh.qparser import QueryParser
with ind.searcher() as searcher:
     query = QueryParser("content", ind.schema).parse("hello world")
     results = searcher.search(query, terms=True)
     for r in results:
         print (r, r.score)
         if results.has_matched_terms():
            print(results.matched_terms())