python PDFminer only parses part of the page

Question

I am parsing a PDF document using module pdfminer python module. I just want to extract text from this document.

The process is going great but, when I extract LTText* objects, I realize that I am not getting all the text inside that LTText* object. It seems like it has an internal buffer or something like that cause texts being cut in every page.

My code:

...
for lt_text_obj in lt_objs:
    if isinstance(lt_text_obj, LTTextBox) or isinstance(lt_text_obj, LTTextLine):
         if lt_text_obj._objs:
             for text_obj in lt_text_obj._objs:
                 if isinstance(text_obj, LTTextBox) or isinstance(text_obj,LTTextLine)]:
                     text_content.append(text_obj)
...

The text_obj variable never contains the entire text, even when this text in the page of pdf file is always formatted the same.

I don't think the problem is in the code cause I also converted the pdf file to txt using pdf2txt.py script and the pages of the resulting txt file is also 'cut'.

It seems that the problem may be in pdfminer configuration or in my pdf file format... I am completely lost.

Any ideas?

Guy Gavriely · Accepted Answer

hard to tell without the input pdf, I'd try to run:

pdf2txt.py -o output.xml path/to/your_input.pdf

this tool is a part of pdfminder and can be very useful for debugging, try to examine the result xml to find the pattern that does not extracted correctly

stian · Answer

is it possible for you to use PyPDF2 instead? I wrote a small "interface" for myself in order to transfer page by page from one pdf-file to another (https://github.com/stianhotboi/pypdf2Interface/blob/master/pypdf2_interface.py). I did not see any problems like yours in my case (all seemed transferred well).

python PDFminer only parses part of the page

Tags:

python

parsing

pdf

pdfminer

juankysmith

2 Answers

Guy Gavriely

stian

Recent Activity

Donate For Us

python PDFminer only parses part of the page

Tags:

python

parsing

pdf

pdfminer

juankysmith

2 Answers

Guy Gavriely

stian

Related questions

Recent Activity

Donate For Us