I am using tesseract ocr to extract text from an image. Preserving the structure of the document is very important to me. Currently tesseract does not preserve the structure, infact it changes the order of text. My input is the image below.
and the output I am getting is as follows:
Someto the left Someto the left Some in the middle Some in the middle Some with some tab Some with some tab Some with some space between them Some with some space between them Sometext here Sometext here this much this much
How do I get the desired output as of the same structure in image?
i.e. as follows:
Some text here Some text here Some to the left Some to the left Some in the middle Some in the middle Some with some tab Some with some tab Some with some space between them this much Some with some space between them this much
The --oem argument, or OCR Engine Mode, controls the type of algorithm used by Tesseract. The --psm controls the automatic Page Segmentation Mode used by Tesseract.
While Tesseract is known as one of the most accurate free OCR engines available today, it has numerous limitations that dramatically affect its performance; its ability to correctly recognize characters in a scan or image.
Tesseract reads only image files, not pdf. You can convert PDF to image (tif, png) and OCR those. Or use wrappers that use tesseract.
Without libtiff, Tesseract can only read uncompressed and G3 compressed TIFF files.
Newer versions of tesseract (3.04) have an option called preserve_interword_spaces
which should do what you want.
Note that the number of spaces tesseract detects between words may not always be the same between similar lines. So words that are left-aligned with a run of spaces preceding them (as in your example) may not be output this way -- the preserve_interword_spaces
option does not attempt to do anything fancy, it merely preserves the spaces extraction found. By default tesseract collapses runs of spaces into one.
Details on this option are here.
The only reliable way would be enabling hOCR output and parsing it. It will contain positions of each word on the page in pixels, as in the original image.
You can do it by specifying tessedit_create_hocr 1
in Tesseract's config file, or in whatever API you use.
hOCR is a subset of HTML, and what Tesseract generates isn't always a valid XML, so you can either use an HTML parser or write your own, but you can't use reliably an XML parser.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With