Python pdfminer LAParams mixes text output

Tags:

i have a pdf file and i wanna parse text from it with pdfminer.The problem is LAParams sometimes fails and give some portion of the line at the end.I can't figure out why. My pdf looks like this: pdf Out put looks like this: output My code is here,thanks in advance:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec , laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    caching = True
    pagenos=set()

    for PageNumer,page in enumerate(PDFPage.get_pages(fp, pagenos , password=password,caching=caching, check_extractable=True)):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
print(convert_pdf_to_txt('C:\\Users\\Vagos\\Desktop\\europe.pdf'))

921

asked Dec 09 '17 15:12

Ευάγγελος Γρηγορόπουλος

1 Answers

Found the answer myself. LAParams() has word_margin default 0.3 . My document apparently sometimes had bigger and that causes the problems. Replacing LAParams() with LAParams(char_margin = 20) solved the issue.There other variable also see http://nullege.com/codes/search/pdfminer.layout.LAParams

199

answered Oct 02 '22 00:10

Ευάγγελος Γρηγορόπουλος

Related questions
                            
                                How to create strings from dataframe columns elements in Python?
                            
                                Python Pandas Dynamically Create a Dataframe
                            
                                Python retry using the tenacity module
                            
                                How to parse and evaluate a math expression with Pandas Dataframe columns?
                            
                                pandas chained_assignment warning exception handling
                            
                                Not able to install new wxpython
                            
                                Jupyter Notebook figure size settings
                            
                                Tornado: get request arguments
                            
                                Pytorch: how to convert data into tensor
                            
                                Object is not subscripable networkx
                            
                                How to sync only the changed files from the remote directory using pysftp?
                            
                                Error when install picamera on python 3.5.2 windows 10
                            
                                find pairs of rows in numpy array that differ only by sign
                            
                                Choose the number of decimal points in string interpolation
                            
                                How do I convert local .JPG file to Base64 to work with Boto3 and Detect_Text?
                            
                                Why is Twine 1.9.1 still uploading to legacy PyPi?
                            
                                Django unable to migrate PostgreSQL: constraint X of relation Y does not exist
                            
                                In redis, how do I delete one key and get its value at the same time
                            
                                Django How can i split string using template tag
                            
                                how can I quickly convert in python an xlsx file into a csv file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python pdfminer LAParams mixes text output

Tags:

python

pdfminer

Ευάγγελος Γρηγορόπουλος

People also ask

1 Answers

Ευάγγελος Γρηγορόπουλος

Recent Activity

Donate For Us