Need to parse a PDF file in order to extract just the first initial lines of text, and have looked for different Python packages to do the job, but without any luck.
Having tried:
PDFminer, PDFminer.six and PDFminer3k, which appears to be overly complex for the simple job, and I was unable to find a simple working example
slate, got error in installation, though worked with fix from thread, but got error when trying; maybe using wrong PDFminer, but can't figure which to use
PyPDF2 and PyPDF3 but these gave garbage as described here
tika, that gave different terminal error messages and was very slow
pdftotext failed to install
pdf2text failed at "import pdf2text", and when changed to "pdftotext" failed to import with "ImportError: cannot import name 'Extractor'" even through pip list
shows that "Extractor" is installed
Usually I find that installed Python packages work amazingly well, but parsing PDF to text appears to be a jungle, which the myriad of tools also indicates.
Any suggestion of how to do simple parsing of a PDF file to text in Python?
PyPDF2 example added
An example of PyPDF2 is:
import PyPDF2
pdfFileObj = open('file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj_0 = pdfReader.getPage(0)
print(pageObj_0.extractText())
Which returns garbage as:
$%$%&%&$'(' ˜!)"*+#
Based on pdfminer, I was able to extract the bare necessity from the pdf2txt.py
script (provided with pdfminer) into a function:
import io
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
def pdf_to_text(path):
with open(path, 'rb') as fp:
rsrcmgr = PDFResourceManager()
outfp = io.StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, outfp, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
text = outfp.getvalue()
return text
@EquipDev your solution actually works quite nicely for me, though it is tab delimited rather than space. I would make one change to the last line:
return text.replace('\t', ' ')
#replace tabs with spaces
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With