I am trying to parse a PDF and create some kind of a hierarchical structure. Consider the input
Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Title 1.1
some more text some more text some more text some more text
some more text some more text some more text some more text
some more text some more text
Title 2
some final text some final text
some final text some final text some final text some final text
some final text some final text some final text some final text
here is how i can extract the outline/titles
path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title)
this gives me
(1, u'Title 1')
(2, u'Title 1.1')
(1, u'Title 2')
which is perfect, as the levels are aligned with the text hierarchy. Now I can extract the text as follows
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBox):
text_from_pdf.write(''.join([i if ord(i) < 128 else ' ' for i in element.get_text()]))
which gives me
Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Title 1.1
some more text some more text some more text some more text
some more text some more text some more text some more text
some more text some more text
Title 2
some final text some final text
some final text some final text some final text some final text
some final text some final text some final text some final text
which is ok as far as the order goes, but now i have lost all sense of hierarchy. How do i know where a title ends and another begins? Also, who is the parent, if any of a title/heading?
Is there a way to connect the outline
information to the layout
elements? It would be great to be able to parse all the information while iterating through the levels.
Another problem is that if there are any citations at the bottom of a page, then the citation text gets mixed in with the results. Is there a way to ignore the headers, footers and citations when parsing a PDF?
I hope it is possible but it is clearly stated in the pdfminer document as follow
Some PDF documents use page numbers as destinations, while others use page numbers and the physical location within the page. Since PDF does not have a logical structure, and it does not provide a way to refer to any in-page object from the outside, there’s no way to tell exactly which part of text these destinations are referring to.
https://pdfminer-docs.readthedocs.io/programming.html#:~:text=Some%20PDF%20documents,are%20referring%20to.
Thanks
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With