Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to separate title and headers from body text in image

I am using tesseract (through the python wrapper) in order to extract text from documents. These documents do not include any images or tables, simply text.

Is there any option to distinguish the titles/headings from the text? Ideally I want to be able to have something like a xml tree rather than the full string chain (I do not need to have a visual of the document layout).

I found some third party tools that seem to be able to help but I was wondering if I can do it directly from tesseract.

enter image description here

like image 669
Prikers Avatar asked Jul 13 '18 07:07

Prikers


2 Answers

You can use Nanonets OCR api for create your own model that seperates headings and text or you can add different labels.

like image 57
vencra Avatar answered Oct 06 '22 22:10

vencra


I am quite late to answer, but this answer might help others who are looking for a solution.

firstly, tesseract only wont be able to extract such "features" from the document. But all you need it a little bit of understanding of ML and vision libraries(like luminoth or detectronV2)

basically, you have to give some sample documents with mark-ups (like title, header1, header2 etc) and train the model. after training you can use the model on different unseen images to fetch such details.

like image 1
sohel shaikh Avatar answered Oct 06 '22 20:10

sohel shaikh