Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect white characters on black background using Tesseract

Tags:

tesseract

I'm completely new to Tesseract OCR. This problem might be simple but I can't seem to find the answer using Google.

Basically, I have an image that contains two parts: the first part, which is at the top of the image, has a black background with texts in white color; the second part, which is at the bottom of the image, has white background with texts in black color.

I ran tesseract on the image, which correctly recognized all characters in the bottom part, but none in the top part. I am sure that the characters on the top part is very clear and should be easy to recognize by Tesseract. The only difference is that it has black background.

Is there a way to use Tesseract to recognize texts in both black and white background at the same time?

like image 403
Chaoran Avatar asked Aug 17 '16 17:08

Chaoran


People also ask

How do you get a white background on a black background?

Open your device's Settings app . Select Accessibility. Under "Display," select Color inversion. Turn on Use color inversion.

Is reading white text on black background?

Contrast with a Black or Dark Background While white text on a black background provides very high value contrast, it is less readable and causes greater eye fatigue than black text on a white background. All light-colored text on dark backgrounds causes eye fatigue.

Why is the Tesseract OCR not accurate?

Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.


1 Answers

A paper by T. Kasar, J. Kumar, and A. G. Ramakrishnan describes one solution to the problem: "Font and Background Color Independent Text Binarization". The paper can be found here. There is an implementation of the algorithm by Jason Funk. His implementation can be found here. I have had some success with the algorithm. I think this type of solution is what you are looking for.

You might also find it helpful to review this recently asked question on background removal (OpenCV for OCR: How to compute thresholding levels for gray image OCR) and its answer. You may be able separate regions of interest by background color and then hand each region to tesseract for processing. Alternatively, post binarization you could invert the 8x8 pixel regions (described in answer above) in the black background portion of the image (or vice versus) to create a uniform background.

Finally, you may find some useful information by searching for solutions to the number plate recognition problem (or license plates). Many number plates (license plates) have background images or lighting artifacts that can interfere with recognition. The more general problem is background removal.

like image 75
John Morris Avatar answered Oct 15 '22 05:10

John Morris