I have been endlessly searching for a tool that can extract text from a PDF while maintaining structure. That is, given a text like this:
Subtitle1
Body1
Subtitle2
Body2
OR
Subtitle1. Body1
Subtitle2. Body2
I want a tool that can output a list of titles, subtitles and bodies. Or, if anybody knows how to do this, that would also be useful :)
This would be easier if these 3 categories would be in the same format, but sometimes the subtitles can be bold, italic, underlined, or a random combination of the 3. Same for the titles. The problem with simple parsing from HTML/PDF/Docx is that these texts have no standard, and so quite often we can encounter sentences divided in several tags (in the case of HTML) and being a really hard to parse. As you can see, the subtitles are not always above a given paragraph or are sometimes in bullet points. So many possible combinations of formatting...
So far I have encountered similar inquiries in here using Tesseract and here using OpenCV, yet none of them quite answer my question.
I know that there are some machine learning tools to extract "Table of Contents" sections from scientific papers, but that also does not cut it. Does anyone know of a package/library, or if such thing has been implemented yet? Or does anyone know an approach to solve this problem, preferably in Python?
Thank you!
The documents I am refering to are 10-Ks from companies, such as this one https://www.sec.gov/Archives/edgar/data/789019/000119312516662209/d187868d10k.htm#tx187868_10 And say, I want to extract Item 7 in a programmatic and structured way as I mentioned above. But not all of them are standardized to do HTML parsing. (The PDF document is just this HTML saved as a PDF)
There are certain tools that can accomplish your requested feature upto a certain extent. By saying "certain extent", I mean that the headings and title font properties will be retained after the OCR conversion.
Take a look at Adobe's Document Cloud platform. It is still in the launch stage and will be launching in early 2020. However, developers can have early access by signing up for the early access program. All the information is available in the following link:
https://www.adobe.com/devnet-docs/dcsdk/servicessdk/index.html
I have personally tried out the service and the outputs seem promising. All heading and title cases get recognised as it is in the input document. The micro service that offers this exact feature is "ExportPDF" service that converts a scanned PDF document to Microsoft Word document.
Sample code is available at: https://www.adobe.com/devnet-docs/dcsdk/servicessdk/howtos.html#export-a-pdf
There is a lot of coding to do here, but let me give you a description of what I would do in Python. This is based on there being some structure in terms of font size and style:
word_num
cv2.ximgproc.thinning
)(sum_inverted_pixels - sum_skeleton_pixels) / sum_skeleton_pixels
(sometimes there will be zero divison error, check when the sum of the skeleton is 0 and return 0 instead)Note that there are ways to detect tables, footers, etc. which I will not dive deeper into. Look for research papers like the one's below.
Relevant research papers:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With