Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read simple text from a PDF file with Python?

Need to parse a PDF file in order to extract just the first initial lines of text, and have looked for different Python packages to do the job, but without any luck.

Having tried:

  • PDFminer, PDFminer.six and PDFminer3k, which appears to be overly complex for the simple job, and I was unable to find a simple working example

  • slate, got error in installation, though worked with fix from thread, but got error when trying; maybe using wrong PDFminer, but can't figure which to use

  • PyPDF2 and PyPDF3 but these gave garbage as described here

  • tika, that gave different terminal error messages and was very slow

  • pdftotext failed to install

  • pdf2text failed at "import pdf2text", and when changed to "pdftotext" failed to import with "ImportError: cannot import name 'Extractor'" even through pip list shows that "Extractor" is installed

Usually I find that installed Python packages work amazingly well, but parsing PDF to text appears to be a jungle, which the myriad of tools also indicates.

Any suggestion of how to do simple parsing of a PDF file to text in Python?

PyPDF2 example added

An example of PyPDF2 is:

import PyPDF2
pdfFileObj = open('file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj_0 = pdfReader.getPage(0)
print(pageObj_0.extractText())

Which returns garbage as:

$%$%&%&$'(' ˜!)"*+#

like image 653
EquipDev Avatar asked Jan 24 '20 10:01

EquipDev


2 Answers

Based on pdfminer, I was able to extract the bare necessity from the pdf2txt.py script (provided with pdfminer) into a function:

import io

from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

def pdf_to_text(path):
    with open(path, 'rb') as fp:
        rsrcmgr = PDFResourceManager()
        outfp = io.StringIO()
        laparams = LAParams()
        device = TextConverter(rsrcmgr, outfp, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp):
            interpreter.process_page(page)
    text = outfp.getvalue()
    return text
like image 123
EquipDev Avatar answered Oct 19 '22 11:10

EquipDev


@EquipDev your solution actually works quite nicely for me, though it is tab delimited rather than space. I would make one change to the last line:

return text.replace('\t', ' ') #replace tabs with spaces

like image 2
jb4earth Avatar answered Oct 19 '22 12:10

jb4earth