如何使用OCR（光学字符识别）在Python中读取PDF内容|极客笔记

如何使用OCR（光学字符识别）在Python中读取PDF内容

Python是当今世界上最受欢迎的编程语言之一。我们可以用它来分析数据，但数据并不总是以所需的格式提供。在这种情况下，我们可以将文件格式从pdf或jpg转换为文本（.txt）格式，以便更好地分析数据。有许多可用于执行此类任务的库。

我们可以使用Python的PyPDF2模块来执行将.pdf文件转换为文本格式的任务。使用此模块时可能会遇到的主要缺点是编码方案。PDF文档文件可以包含多种编码，如Unicode，ASCII，UTF-8等等。因此，将PDF文件转换为文本可能会导致数据丢失，因为编码方案不一致。

在本教程中，我们将学习如何使用“光学字符识别”方法读取PDF文件的内容，并将其保存在文本（.txt）格式文件中。

首先，我们需要将PDF文档文件的页面转换为图像，然后再使用OCR从图像中读取内容并将其保存在文本（.txt）格式文件中。

所需模块

我们必须使用以下命令安装本教程所需的模块：

PIL： –

!pip3 install PIL

pytesseract: –

!pip3 install pytesseract

pdf2image: –

!pip3 install pdf2image

tesseract-ocr: –

!pip3 install tesseract-ocr

(对于此问题，用户应该拥有Microsoft Visual C++ 14.0，可通过”Visual Studio生成工具”获取。 https://visualstudio.microsoft.com/downloads/ )

第1部分：

第一部分将处理将我的PDF页面转换为图像文件。PDF文件的每一页将存储为一个图像文件，图像的名称将存储为：

PDF page no. 1: page_no_1.jpg
PDF page no. 2: page_no_2.jpg
PDF page no. 3: page_no_3.jpg
PDF page no. 4: page_no_4.jpg
.
.
PDF page no. n: page_no_n.jpg

第2部分：

第二部分将处理识别图像文件中的文本，并将其分类到以“.txt”格式的文本文件中。在这里，我们将处理图像文件，将其转换为文本内容。一旦我们有了作为字符串变量的文本，我们就可以开始处理文本（.txt）文件。例如，在许多PDF文件中，我们可以看到当一行完成时，但最后一个单词无法完整写在同一行中时，会在末尾添加一个连字符，并将该单词延续到下一行。例如：

This is an example to show the above explanation of the wo-
rd which cannot be written entirely in the same line and is conti-
nued in the next line.

对于这样的单词，我们将进行基本的预处理，将连字符和后面的行转换为完整的单词。当我们完成预处理后，这个文本将被排序在一个单独的文本文件中。

代码

from PIL import Image as img
import pytesseract as PT
import sys
from pdf2image import convert_from_path as CFP
import os
# Importing the pdf file
PDF_file_1 = "exp.pdf"
pages_1 = CFP(PDF_file1, 9)

# Now, we will create a counter for storing images of each page of PDF to image
image_counter1 = 1

# Iterating through all the pages of the pdf file stored above
for page in pages_1:

    # We will Declare the  filename for each page of PDF file as JPG file
    # For each page, the filename will be:
    # PDF page no. 1: Page_no_1.jpg
    # PDF page no.2: Page_no_2.jpg
    # PDF page no. 3: Page_no_3.jpg
    # PDF page no. 4: Page_no_4.jpg
    # .... and so on..
    # PDF page n: page_n.jpg
    filename1 = "Page_no_" + str(image_counter) + " .jpg"

    # Now, we will save the image of the page in system
    page.save(filename1, 'JPEG')

    # Then, we will increase the counter for updating filenames
    image_counter1 = image_counter1 + 1

'''
Part #2 - Recognize the text content from the image files by using OCR
'''
# Variable for getting the count of the total number of pages
filelimit1 = image_counter1 - 1

# then, we will create a text file for writing the output
out_file1 = "output_text.txt"

# Now, we will open the output file in append mode so that all contents of the # images will be added in the same output file.
f_1 = open(out_file1, "a")

# Iterating from 1 to total number of pages
for K in range(1, filelimit1 + 1):

    # Now, we will set filename for recognizing text from images
    # Again, these files will be:
    # Page_no_1.jpg
    # Page_no_2.jpg
    # Page_no_3.jpg
    # ....
    # page_no_n.jpg
    filename1 = "Page_no_" + str(K) + " .jpg"

    # Here, we will write a code for recognizing the text as a string variable in an image file by using the pytesserct module
    text1 = str(((PT.image_to_string (Image.open (filename1)))))

    # : The recognized text will be stored in variable text
    # : Any string variable processing may be applied to text content
    # : Here, basic formatting will be done:-

    text1 = text1.replace('-\n', '')    

    # At last, we will write the processed text into the file.
    f_1.write(text1)

# Closing the file after writing all the text content.
f_1.close()

输出：

输入PDF文件：

如何使用OCR（光学字符识别）在Python中读取PDF内容