I am trying to convert a very clean PDF file into txt file using python. I have tried using pyPDF2 and PDFMiner, both worked perfectly in text recognition.
However, as in PDF the lines are wrapped, the extracted .txt file have unintended line break at the end: e.g line 1: "is an account of the Elder /n Days, ". There should not be a line break between the "Elder" and the "days".
The PDF file:
When edited with Acrobat, it can be clearly seen the original text in PDF contains no hard line break, and could be edited as a paragraph instead of single lines.
The Code I have tried (adapted from an answer from here: convert from pdf to text: lines and words are broken)
import io as io
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
import sys, getopt
#converts pdf, returns its text content as a string
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = io.StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = open(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
path='D:\Folder\File.pdf'
a=convert(path)
f=open("D:\Folder\File.txt",'a',encoding='utf-8')
f.write(a)
f.close()
"A picture is worth a thousand words" and comments do not allow pictures! I am using the Web archive of a different copy but the Gist is exactly the same.
You are working with "justified" content but unlike reflowing justification in a word processor, the glyphs are generally tied to a line of a set position up from the page base. Adobe are working on reflowable PDFs and have the expertise to combine lines in a paragraph, however other readers will accept</br>
each line for what it is. </br>
<p style=indented>
There are no paragraph boundary markers, like there is in say HTML </p>
Readers could in the future be augmented like Acrobat, to combine the lines, but it's not needed for reading (aloud) one line at a time. Some audio readers will noticeably stutter on those "line at a time" returns, whilst some are intelligently programmed to simply ignore them.
In short you need to add your own AI/regex to gather lines and add indents, but beware significant human literature differences such as hyphenation and oriental punctuation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With