I am parsing a PDF document using module pdfminer
python module. I just want to extract text from this document.
The process is going great but, when I extract LTText*
objects, I realize that I am not getting all the text inside that LTText*
object. It seems like it has an internal buffer or something like that cause texts being cut in every page.
My code:
...
for lt_text_obj in lt_objs:
if isinstance(lt_text_obj, LTTextBox) or isinstance(lt_text_obj, LTTextLine):
if lt_text_obj._objs:
for text_obj in lt_text_obj._objs:
if isinstance(text_obj, LTTextBox) or isinstance(text_obj,LTTextLine)]:
text_content.append(text_obj)
...
The text_obj variable never contains the entire text, even when this text in the page of pdf file is always formatted the same.
I don't think the problem is in the code cause I also converted the pdf file to txt using pdf2txt.py script and the pages of the resulting txt file is also 'cut'.
It seems that the problem may be in pdfminer configuration or in my pdf file format... I am completely lost.
Any ideas?
hard to tell without the input pdf, I'd try to run:
pdf2txt.py -o output.xml path/to/your_input.pdf
this tool is a part of pdfminder and can be very useful for debugging, try to examine the result xml to find the pattern that does not extracted correctly
is it possible for you to use PyPDF2 instead? I wrote a small "interface" for myself in order to transfer page by page from one pdf-file to another (https://github.com/stianhotboi/pypdf2Interface/blob/master/pypdf2_interface.py). I did not see any problems like yours in my case (all seemed transferred well).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With