How to do OCR for PDF text extraction WHILE maintaining text structure (header/subtitle/body)

Question

I have been endlessly searching for a tool that can extract text from a PDF while maintaining structure. That is, given a text like this:

Title

Subtitle1

Body1

Subtitle2

Body2

OR

Title

Subtitle1. Body1

Subtitle2. Body2

I want a tool that can output a list of titles, subtitles and bodies. Or, if anybody knows how to do this, that would also be useful :)

This would be easier if these 3 categories would be in the same format, but sometimes the subtitles can be bold, italic, underlined, or a random combination of the 3. Same for the titles. The problem with simple parsing from HTML/PDF/Docx is that these texts have no standard, and so quite often we can encounter sentences divided in several tags (in the case of HTML) and being a really hard to parse. As you can see, the subtitles are not always above a given paragraph or are sometimes in bullet points. So many possible combinations of formatting...

So far I have encountered similar inquiries in here using Tesseract and here using OpenCV, yet none of them quite answer my question.

I know that there are some machine learning tools to extract "Table of Contents" sections from scientific papers, but that also does not cut it. Does anyone know of a package/library, or if such thing has been implemented yet? Or does anyone know an approach to solve this problem, preferably in Python?

Thank you!

Edit:

The documents I am refering to are 10-Ks from companies, such as this one https://www.sec.gov/Archives/edgar/data/789019/000119312516662209/d187868d10k.htm#tx187868_10 And say, I want to extract Item 7 in a programmatic and structured way as I mentioned above. But not all of them are standardized to do HTML parsing. (The PDF document is just this HTML saved as a PDF)

Karthick Mohanraj · Accepted Answer

There are certain tools that can accomplish your requested feature upto a certain extent. By saying "certain extent", I mean that the headings and title font properties will be retained after the OCR conversion.

Take a look at Adobe's Document Cloud platform. It is still in the launch stage and will be launching in early 2020. However, developers can have early access by signing up for the early access program. All the information is available in the following link:

https://www.adobe.com/devnet-docs/dcsdk/servicessdk/index.html

I have personally tried out the service and the outputs seem promising. All heading and title cases get recognised as it is in the input document. The micro service that offers this exact feature is "ExportPDF" service that converts a scanned PDF document to Microsoft Word document.

Sample code is available at: https://www.adobe.com/devnet-docs/dcsdk/servicessdk/howtos.html#export-a-pdf

Casper Hansen · Answer

There is a lot of coding to do here, but let me give you a description of what I would do in Python. This is based on there being some structure in terms of font size and style:

Use the Tesseract OCR software (open source, free), use OEM 1, PSM 11 in Pytesseract
Preprocess your PDF to an image and apply other relevant preprocessing
Get the output as a dataframe and combine individual words into lines of words by word_num
Compute the thickness of every line of text (by the use of the image and tesseract output)
- Convert image to grayscale and invert the image colors
- Perform Zhang-Suen thinning on the selected area of text on the image (opencv contribution: cv2.ximgproc.thinning)
- Sum where there are white pixels in the thinned image, i.e. where values are equal to 255 (white pixels are letters)
- Sum where there are white pixels in the inverted image
- Finally compute the thickness (sum_inverted_pixels - sum_skeleton_pixels) / sum_skeleton_pixels (sometimes there will be zero divison error, check when the sum of the skeleton is 0 and return 0 instead)
- Normalize the thickness by minimum and maximum values
Get headers by applying a threshold for when a line of text is bold, e.g. 0.6 or 0.7
To distinguish between different a title and subtitle, you have to rely on either enumerated titles and subtitles or the size of the title and subtitle.
- Calculate the font size of every word by converting height in pixels to height in points
- The median font size becomes the local font size for every line of text
Finally, you can categorize titles, subtitles, and everything in between can be text.

Note that there are ways to detect tables, footers, etc. which I will not dive deeper into. Look for research papers like the one's below.

Relevant research papers:

An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles. DOI: 10.1007/978-3-642-40501-3_15.
Image-based logical document structure recognition. DOI: 10.1007/978-3-642-40501-3_15.

How to do OCR for PDF text extraction WHILE maintaining text structure (header/subtitle/body)

Tags:

python

opencv

pdf

ocr

tesseract

Title

Title

Edit:

Daniel Firebanks-Quevedo

2 Answers

Karthick Mohanraj

Casper Hansen

Recent Activity

Donate For Us

How to do OCR for PDF text extraction WHILE maintaining text structure (header/subtitle/body)

Tags:

python

opencv

pdf

ocr

tesseract

Title

Title

Edit:

Daniel Firebanks-Quevedo

2 Answers

Karthick Mohanraj

Casper Hansen

Related questions

Recent Activity

Donate For Us