Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenCV Gaussian blur breaks Tesseract?

Tags:

c++

tesseract

The problem: a week or so ago, in hopes of improving Tesseract's accuracy more, I added a Guassian blur / OTSU binarization combo which results in beautiful binary images like the one attached. I do this in openCV, so the image that I pass to Tesseract is already a binary image (like the one attached). When Tesseract does its pre-processing of the image (even the one posted below) the image becomes corrupted and therefore no meaningful output is produced. See the image below the example input, for an idea of what Tesseract is doing to the image.

The source of the problem is the Guassian blur. If I remove it, the Thresholded image that tesseract outputs is not garbled, but it is also not as clean and readable as the binary image I attached. Can I disable Tesseract from pre-processing the images I pass it? Why does a Guassian blur completely ruin Tesseract? I feel as if the input image were as clear as the one I attached, accuracy would be improved.

Both images are of the same column. First is input image, second is the result of Tesseract's image pre-processing.

INPUT TO TESSERACT EXAMPLE:

image

TESSERACT CORRUPTION (obtained from GetThresholdedImage():

two

like image 922
Trés DuBiel Avatar asked Jan 16 '16 23:01

Trés DuBiel


1 Answers

I would suggest to save image data from tesseract (tess.GetThresholdedImage()) and store it to disk after tess.SetImage(), so you can be sure you provided correct image for OCR.

like image 138
user898678 Avatar answered Sep 24 '22 07:09

user898678