Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pdfminer LAParams mixes text output

i have a pdf file and i wanna parse text from it with pdfminer.The problem is LAParams sometimes fails and give some portion of the line at the end.I can't figure out why. My pdf looks like this: pdf Out put looks like this: output My code is here,thanks in advance:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec , laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    caching = True
    pagenos=set()

    for PageNumer,page in enumerate(PDFPage.get_pages(fp, pagenos , password=password,caching=caching, check_extractable=True)):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
print(convert_pdf_to_txt('C:\\Users\\Vagos\\Desktop\\europe.pdf'))

People also ask

What is LAParams in Pdfminer?

laparams – An LAParams object from pdfminer. layout. If None, uses some default set- tings that often work well. Returns a string containing all of the text extracted.


1 Answers

Found the answer myself. LAParams() has word_margin default 0.3 . My document apparently sometimes had bigger and that causes the problems. Replacing LAParams() with LAParams(char_margin = 20) solved the issue.There other variable also see http://nullege.com/codes/search/pdfminer.layout.LAParams