Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract OCR Read Horizontally rather than Vertically C#

Tags:

c#

ocr

tesseract

We have a C# .Net app that is using Tesseract to do Optical Character Recognition (OCR) on .tiff files. Here's an Example: Example tiff fiel that Tesseract reads

We are then outputting the data to a text file. However, Tesseract is reading the data in a Vertical fashion. In my example image, it is reading the tiff as two columns of data and the data the data is being outputted from Tesseract like this:

TYPE: DATE: Address: City: State: Owner: Owner Type: Acreage: Mortgage: 12345 2017-04-06 100 Main St. Some City Some State John Doe Primary 10.25 Yes

What we want is Tesseract to read the tiff file horizontally and have the output look like this:

TYPE:12345 DATE:2017-04-06 Address:100 Main St. City:Some City State:Some State Owner:John Doe Owner Type:Primary Acreage:10.25 Mortgage:Yes

We've tried the various Page Sementation options for Tesseract, but they all produce the same result.

Has anyone run into this same issue? Anybody have any ideas?

like image 777
MikeTWebb Avatar asked Jan 21 '26 01:01

MikeTWebb


1 Answers

I found a solution. Tesseract has a set of config files. Inside several of these config files is the setting tessedit_pageseg_mode. This setting was set to 1 in all the config files. 1=Automatic page segmentation with OSD. OSD=Orientation and script detection.

Bottom line, these config file settings were overwriting our command line argument. Once I removed the tessedit_pageseg_mode parameter from the config files, our command line argument of

-psm 6 worked and produced the output data in the desired format.

psm=Page Segmentation Mode. 6=Assume a single uniform block of text

-psm 4 also worked

psm=Page Segmentation Mode. 4=Assume a single column of text of variable sizes

like image 189
MikeTWebb Avatar answered Jan 22 '26 17:01

MikeTWebb



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!