Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to preserve document structure in tesseract

Tags:

I am using tesseract ocr to extract text from an image. Preserving the structure of the document is very important to me. Currently tesseract does not preserve the structure, infact it changes the order of text. My input is the image below.

input

and the output I am getting is as follows:

Someto the left Someto the left  Some in the middle Some in the middle  Some with some tab Some with some tab  Some with some space between them Some with some space between them  Sometext here Sometext here  this much this much 

How do I get the desired output as of the same structure in image?

i.e. as follows:

                                                 Some text here                                                  Some text here  Some to the left Some to the left                      Some in the middle                     Some in the middle          Some with some tab         Some with some tab  Some with some space between them                       this much Some with some space between them                       this much 
like image 813
Sar009 Avatar asked Mar 24 '14 12:03

Sar009


People also ask

What is OEM and PSM in Tesseract?

The --oem argument, or OCR Engine Mode, controls the type of algorithm used by Tesseract. The --psm controls the automatic Page Segmentation Mode used by Tesseract.

Is Tesseract good for OCR?

While Tesseract is known as one of the most accurate free OCR engines available today, it has numerous limitations that dramatically affect its performance; its ability to correctly recognize characters in a scan or image.

Can Tesseract-OCR read PDF?

Tesseract reads only image files, not pdf. You can convert PDF to image (tif, png) and OCR those. Or use wrappers that use tesseract.

Does Tesseract support TIFF?

Without libtiff, Tesseract can only read uncompressed and G3 compressed TIFF files.


2 Answers

Newer versions of tesseract (3.04) have an option called preserve_interword_spaces which should do what you want.

Note that the number of spaces tesseract detects between words may not always be the same between similar lines. So words that are left-aligned with a run of spaces preceding them (as in your example) may not be output this way -- the preserve_interword_spaces option does not attempt to do anything fancy, it merely preserves the spaces extraction found. By default tesseract collapses runs of spaces into one.

Details on this option are here.

like image 128
David Avatar answered Sep 27 '22 02:09

David


The only reliable way would be enabling hOCR output and parsing it. It will contain positions of each word on the page in pixels, as in the original image.

You can do it by specifying tessedit_create_hocr 1 in Tesseract's config file, or in whatever API you use.

hOCR is a subset of HTML, and what Tesseract generates isn't always a valid XML, so you can either use an HTML parser or write your own, but you can't use reliably an XML parser.

like image 33
Karol S Avatar answered Sep 26 '22 02:09

Karol S