I am using tesseract (through the python wrapper) in order to extract text from documents. These documents do not include any images or tables, simply text.
Is there any option to distinguish the titles/headings from the text? Ideally I want to be able to have something like a xml tree rather than the full string chain (I do not need to have a visual of the document layout).
I found some third party tools that seem to be able to help but I was wondering if I can do it directly from tesseract.
You can use Nanonets OCR api for create your own model that seperates headings and text or you can add different labels.
I am quite late to answer, but this answer might help others who are looking for a solution.
firstly, tesseract only wont be able to extract such "features" from the document. But all you need it a little bit of understanding of ML and vision libraries(like luminoth or detectronV2)
basically, you have to give some sample documents with mark-ups (like title, header1, header2 etc) and train the model. after training you can use the model on different unseen images to fetch such details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With