How to OCR multiple column in a document using tesseract

Question

I working on a project of OCR sinhala language using tesseract. My goal is ocr, multiple column including text in a document. And get out put file in a correct format. Is there any method to identify column in a document using tesseract?

Omar Wasow · Accepted Answer

Setting tesseract to work with a multi-column document is surprisingly easy though I found very little information or discussion specifically about multi-column pages online. The basic idea is to set the page segmentation method to do both "Automatic page segmentation" (the default) AND "Orientation and script detection" (OSD, not the default setting).

This is as simple as putting the psm setting to 1 which tells tesseract to "Automatic page segmentation with OSD." While it may not be obvious that OSD = recognize a multicolumn document, in practical terms that's one of the outcomes. Another benefit is that the script detection helps tesseract avoid trying to OCR non-text blocks like photographs.

For more on page segmentation methods, see: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

Here is a sample of the command line syntax to adjust the page segmentation method

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

For more on the syntax, see: https://github.com/tesseract-ocr/tesseract/wiki

How to OCR multiple column in a document using tesseract

Tags:

c++

ocr

tesseract

Sandun Tharaka

1 Answers

Omar Wasow

Recent Activity

Donate For Us

How to OCR multiple column in a document using tesseract

Tags:

c++

ocr

tesseract

Sandun Tharaka

1 Answers

Omar Wasow

Related questions

Recent Activity

Donate For Us