We have a C# .Net app that is using Tesseract to do Optical Character Recognition (OCR) on .tiff files. Here's an Example:
We are then outputting the data to a text file. However, Tesseract is reading the data in a Vertical fashion. In my example image, it is reading the tiff as two columns of data and the data the data is being outputted from Tesseract like this:
TYPE: DATE: Address: City: State: Owner: Owner Type: Acreage: Mortgage: 12345 2017-04-06 100 Main St. Some City Some State John Doe Primary 10.25 Yes
What we want is Tesseract to read the tiff file horizontally and have the output look like this:
TYPE:12345 DATE:2017-04-06 Address:100 Main St. City:Some City State:Some State Owner:John Doe Owner Type:Primary Acreage:10.25 Mortgage:Yes
We've tried the various Page Sementation options for Tesseract, but they all produce the same result.
Has anyone run into this same issue? Anybody have any ideas?
I found a solution. Tesseract has a set of config files. Inside several of these config files is the setting tessedit_pageseg_mode. This setting was set to 1 in all the config files. 1=Automatic page segmentation with OSD. OSD=Orientation and script detection.
Bottom line, these config file settings were overwriting our command line argument. Once I removed the tessedit_pageseg_mode parameter from the config files, our command line argument of
-psm 6 worked and produced the output data in the desired format.
psm=Page Segmentation Mode. 6=Assume a single uniform block of text
-psm 4 also worked
psm=Page Segmentation Mode. 4=Assume a single column of text of variable sizes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With