I am using Tesseract in iOS 8 for an OCR based app but it incorrectly converts the division "÷" symbol in the image to a plus "+" sign.
For example, this image
always converts to the text string "8+4+4". It should be "8+4÷4".
I've tried using different trained data language files "eng+equ", "ita", adding "÷" to the whitelist, setting the ocr_engine variable to cube, converting image to grayscale or black & white, upsizing the image by 2 and 4 times.
Everything I've tried always returns a plus "+" sign instead of a division "÷" symbol.
I tried using only the "equ" trained data file and that DOES return the division symbol correctly - but all other characters are then garbage.
I've been looking into this (Google, Stackoverflow) for several days and cannot figure it out.
How do I get Tesseract to include and recognize the division "÷" symbol?
UPDATE:
The best I have been able to do is to set the AVCaptureSession preset to high
AVCaptureSession *session = [[AVCaptureSession alloc] init];
session.sessionPreset = AVCaptureSessionPresetHigh;
The captured image above dimensions are then 676 × 405 pixels. Using Tesseract OCR UIImage category (image is named 'source') to binarize the image:
// Binarize the source image to improve contrast (using the UIImage category provided by TesseractOCR)
UIImage *blackAndWhiteImage = [source blackAndWhite];
[self.tesseract setImage:blackAndWhiteImage];
This will usually convert the division symbol to the text "-1-", but I've seen "-:-" and other numbers and uppercase characters between the minus signs.
I can check for that in the returned text. But then it is impossible to know whether to treat the returned text "8-1-2" as a true subtraction or 'maybe' division.
Train the or engine wit different fonts.
Here is the tool for training the engine. Have a look on this also
Or you can use JTessBoxEditor
Make sure your "white list" includes"÷" sign.
In swift, this will do it: tesseract.setVariableValue("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;,.!-()#&÷", forKey: "tessedit_char_whitelist")
In objective-C, here is the code:
[tesseract setVariableValue:@"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;,.!-()#&÷" forKey:@"tessedit_char_whitelist"];
You can customize the character set based on your needs.
It seems that symbol was not included in the existing data. You'd need to train for that symbol, and then use the resultant traineddata in combination with existing ones.
You can use a tool, such as jTessBoxEditor, to assist you in the training process.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With