How to preserve document structure in tesseract

Tags:

I am using tesseract ocr to extract text from an image. Preserving the structure of the document is very important to me. Currently tesseract does not preserve the structure, infact it changes the order of text. My input is the image below.

input

and the output I am getting is as follows:

Someto the left Someto the left  Some in the middle Some in the middle  Some with some tab Some with some tab  Some with some space between them Some with some space between them  Sometext here Sometext here  this much this much

How do I get the desired output as of the same structure in image?

i.e. as follows:

                                                 Some text here                                                  Some text here  Some to the left Some to the left                      Some in the middle                     Some in the middle          Some with some tab         Some with some tab  Some with some space between them                       this much Some with some space between them                       this much

813

asked Mar 24 '14 12:03

Sar009

2 Answers

Newer versions of tesseract (3.04) have an option called preserve_interword_spaces which should do what you want.

Note that the number of spaces tesseract detects between words may not always be the same between similar lines. So words that are left-aligned with a run of spaces preceding them (as in your example) may not be output this way -- the preserve_interword_spaces option does not attempt to do anything fancy, it merely preserves the spaces extraction found. By default tesseract collapses runs of spaces into one.

Details on this option are here.

128

answered Sep 27 '22 02:09

David

The only reliable way would be enabling hOCR output and parsing it. It will contain positions of each word on the page in pixels, as in the original image.

You can do it by specifying tessedit_create_hocr 1 in Tesseract's config file, or in whatever API you use.

hOCR is a subset of HTML, and what Tesseract generates isn't always a valid XML, so you can either use an HTML parser or write your own, but you can't use reliably an XML parser.

answered Sep 26 '22 02:09

Karol S

Related questions
                            
                                How to force Git to commit a file if it is recognized as unchanged
                            
                                How to safely clean up AngularJS event binding in a directive
                            
                                WPF : Dispatcher processing has been suspended, but messages are still being processed
                            
                                Make a phone call in Windows Phone 8.1
                            
                                Python Requests: requests.exceptions.TooManyRedirects: Exceeded 30 redirects
                            
                                Issue with simple Makefile: undefined reference to symbol 'cos@@GLIBC_2.2.5'
                            
                                Disable showing of run window in IntelliJ
                            
                                Ansible get the username from the command line
                            
                                Access 2013 - Cannot open a database created with a previous version of your application
                            
                                Can an angular directive require its own controller?
                            
                                Jquery Datatables column rendering and sorting
                            
                                Does Chrome use XPath 2.0?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With