Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to OCR multiple column in a document using tesseract

Tags:

c++

ocr

tesseract

I working on a project of OCR sinhala language using tesseract. My goal is ocr, multiple column including text in a document. And get out put file in a correct format. Is there any method to identify column in a document using tesseract?

like image 771
Sandun Tharaka Avatar asked Jul 27 '15 10:07

Sandun Tharaka


1 Answers

Setting tesseract to work with a multi-column document is surprisingly easy though I found very little information or discussion specifically about multi-column pages online. The basic idea is to set the page segmentation method to do both "Automatic page segmentation" (the default) AND "Orientation and script detection" (OSD, not the default setting).

This is as simple as putting the psm setting to 1 which tells tesseract to "Automatic page segmentation with OSD." While it may not be obvious that OSD = recognize a multicolumn document, in practical terms that's one of the outcomes. Another benefit is that the script detection helps tesseract avoid trying to OCR non-text blocks like photographs.

For more on page segmentation methods, see: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

Here is a sample of the command line syntax to adjust the page segmentation method

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

For more on the syntax, see: https://github.com/tesseract-ocr/tesseract/wiki

like image 132
Omar Wasow Avatar answered Oct 27 '22 23:10

Omar Wasow