Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve Text recognition usingTesseract OCR.?

I had implemented tesseract ocr for text recognition in IOS.I had preprocessed the input image and give into Tesseract method.It gives poor recognition result.

Steps:

1.Erode function

2.Dilate function

3.Bitwise_not function

 Mat MCRregion;
 cv::dilate ( MCRregion, MCRregion, 24);
 cv::erode ( MCRregion, MCRregion, 24);
 cv::bitwise_not(MCRregion, MCRregion);

 UIImage * croppedMCRregion = [self UIImageFromCVMat:MCRregion];

    Tesseract* tesseract = [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"eng"];
    [tesseract setVariableValue:@"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.>,'`;-:</" forKey:@"tessedit_char_whitelist"];
    [tesseract setImage:[self UIImageFromCVMat:MCRregion]];
    //                [tesseract setImage:image];
    [tesseract recognize];

    NSLog(@"%@", [tesseract recognizedText]);

Input Image:

Image Link

1.How to Improve text recognition rate using Tesseract ?

2.Is any other pre processing steps applied in Tesseract.?

3.Is dewarp text Done in Tesseract OCR.?

like image 257
balajichinna Avatar asked Dec 02 '22 19:12

balajichinna


2 Answers

Tesseract is a highly configurable piece of software -- though its configurations are poorly documented (unless you want to dig deep in the 150K lines of code). A good comprehensive list is present here http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version.

Also look at https://code.google.com/p/tesseract-ocr/wiki/ControlParams and https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality

You can improve the quality tremendously if you feed more info about the data you're OCR'ing. e.g. in case the images are all National IDs or Passports which follow certain standard MRZ formats, you can configure tesseract to use that info.

For the image you attach (an MRZ), i got the following result,

IDFRADOUEL<<<<<<<<<<<<<<<<<<<<9320 
05O693202O438CHRISTIANE<<N1Z90620<3

by using the following config

# disable dict, freq tables etc which would distract OCR'ing an MRZ
load_system_dawg F
load_freq_dawg F
load_unambig_dawg F
load_punc_dawg F
load_number_dawg F
load_fixed_length_dawgs F
load_bigram_dawg F
wordrec_enable_assoc F

# mrz allows only these chars
tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ<

Also see that your installation is trained for the fonts to get more accurate results. In your case it seems it is OCR-B font.

like image 139
baskin Avatar answered Dec 20 '22 21:12

baskin


It is not necessary to go through the tedious task of retraining Tesseract. Yes, you will get much better results but in some cases you can get pretty far with the ENG training set.

You can improve your results by paying attention to the following things:

  1. Use a binary image as input and make sure you have black text on a white background

  2. By default Tesseract will try to make words out of things that have no spacing. Try to segment each character seperately and place them in a new image with lots of spacing. Especially if you have combinations of letters and numbers Tesseract will "correct" this to match the surrounding characters.

  3. Try to segment different parts of your image with a whitelist for the characters you know should be in there. If your only looking for digits in the first part then use a seperate instance of Tesseract to detect these numbers with a number only whitelist.

  4. If you use the same object multiple times without resetting it Tesseract seems to have a memory. That means that you can get a different result each time you perform OCR. You can reset Tesseract to counter this or just create a new object.

  5. Last but not least, use the resultIterator to go through the boxes that Tesseract can give as a result. You can check the size and confidence of each character and filter accordingly.

like image 40
diip_thomas Avatar answered Dec 20 '22 20:12

diip_thomas