How to separate title and headers from body text in image

Question

I am using tesseract (through the python wrapper) in order to extract text from documents. These documents do not include any images or tables, simply text.

Is there any option to distinguish the titles/headings from the text? Ideally I want to be able to have something like a xml tree rather than the full string chain (I do not need to have a visual of the document layout).

I found some third party tools that seem to be able to help but I was wondering if I can do it directly from tesseract.

enter image description here

vencra · Accepted Answer

You can use Nanonets OCR api for create your own model that seperates headings and text or you can add different labels.

sohel shaikh · Answer

I am quite late to answer, but this answer might help others who are looking for a solution.

firstly, tesseract only wont be able to extract such "features" from the document. But all you need it a little bit of understanding of ML and vision libraries(like luminoth or detectronV2)

basically, you have to give some sample documents with mark-ups (like title, header1, header2 etc) and train the model. after training you can use the model on different unseen images to fetch such details.

How to separate title and headers from body text in image

Tags:

python

opencv

ocr

tesseract

python-tesseract

Prikers

2 Answers

vencra

sohel shaikh

Recent Activity

Donate For Us

How to separate title and headers from body text in image

Tags:

python

opencv

ocr

tesseract

python-tesseract

Prikers

2 Answers

vencra

sohel shaikh

Related questions

Recent Activity

Donate For Us