Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract OCR won't recognize division symbol "÷"

I am using Tesseract in iOS 8 for an OCR based app but it incorrectly converts the division "÷" symbol in the image to a plus "+" sign.

For example, this image

Simple arithmetic expression

always converts to the text string "8+4+4". It should be "8+4÷4".

I've tried using different trained data language files "eng+equ", "ita", adding "÷" to the whitelist, setting the ocr_engine variable to cube, converting image to grayscale or black & white, upsizing the image by 2 and 4 times.

Everything I've tried always returns a plus "+" sign instead of a division "÷" symbol.

I tried using only the "equ" trained data file and that DOES return the division symbol correctly - but all other characters are then garbage.

I've been looking into this (Google, Stackoverflow) for several days and cannot figure it out.

How do I get Tesseract to include and recognize the division "÷" symbol?

UPDATE:

The best I have been able to do is to set the AVCaptureSession preset to high

AVCaptureSession *session = [[AVCaptureSession alloc] init];
session.sessionPreset = AVCaptureSessionPresetHigh;

The captured image above dimensions are then 676 × 405 pixels. Using Tesseract OCR UIImage category (image is named 'source') to binarize the image:

// Binarize the source image to improve contrast (using the UIImage category provided by TesseractOCR)
UIImage *blackAndWhiteImage = [source blackAndWhite];
[self.tesseract setImage:blackAndWhiteImage];

This will usually convert the division symbol to the text "-1-", but I've seen "-:-" and other numbers and uppercase characters between the minus signs.

I can check for that in the returned text. But then it is impossible to know whether to treat the returned text "8-1-2" as a true subtraction or 'maybe' division.

like image 545
Craig Pickering Avatar asked Nov 16 '14 12:11

Craig Pickering


3 Answers

Train the or engine wit different fonts.

Here is the tool for training the engine. Have a look on this also

Or you can use JTessBoxEditor

like image 109
Neenu Avatar answered Oct 07 '22 12:10

Neenu


Make sure your "white list" includes"÷" sign.

In swift, this will do it: tesseract.setVariableValue("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;,.!-()#&÷", forKey: "tessedit_char_whitelist")

In objective-C, here is the code:

[tesseract setVariableValue:@"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;,.!-()#&÷" forKey:@"tessedit_char_whitelist"];

You can customize the character set based on your needs.

like image 23
Mikrasya Avatar answered Oct 07 '22 12:10

Mikrasya


It seems that symbol was not included in the existing data. You'd need to train for that symbol, and then use the resultant traineddata in combination with existing ones.

You can use a tool, such as jTessBoxEditor, to assist you in the training process.

like image 25
nguyenq Avatar answered Oct 07 '22 12:10

nguyenq