Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python PDFminer only parses part of the page

I am parsing a PDF document using module pdfminer python module. I just want to extract text from this document.

The process is going great but, when I extract LTText* objects, I realize that I am not getting all the text inside that LTText* object. It seems like it has an internal buffer or something like that cause texts being cut in every page.

My code:

...
for lt_text_obj in lt_objs:
    if isinstance(lt_text_obj, LTTextBox) or isinstance(lt_text_obj, LTTextLine):
         if lt_text_obj._objs:
             for text_obj in lt_text_obj._objs:
                 if isinstance(text_obj, LTTextBox) or isinstance(text_obj,LTTextLine)]:
                     text_content.append(text_obj)
...

The text_obj variable never contains the entire text, even when this text in the page of pdf file is always formatted the same.

I don't think the problem is in the code cause I also converted the pdf file to txt using pdf2txt.py script and the pages of the resulting txt file is also 'cut'.

It seems that the problem may be in pdfminer configuration or in my pdf file format... I am completely lost.

Any ideas?

like image 421
juankysmith Avatar asked Nov 07 '13 10:11

juankysmith


2 Answers

hard to tell without the input pdf, I'd try to run:

pdf2txt.py -o output.xml path/to/your_input.pdf

this tool is a part of pdfminder and can be very useful for debugging, try to examine the result xml to find the pattern that does not extracted correctly

like image 158
Guy Gavriely Avatar answered Oct 11 '22 22:10

Guy Gavriely


is it possible for you to use PyPDF2 instead? I wrote a small "interface" for myself in order to transfer page by page from one pdf-file to another (https://github.com/stianhotboi/pypdf2Interface/blob/master/pypdf2_interface.py). I did not see any problems like yours in my case (all seemed transferred well).

like image 34
stian Avatar answered Oct 11 '22 22:10

stian