I'm writing an application to scan numbers from an image. The numbers are using the OCR-B font and may also contain <code>+</code> and <code>></code> characters. This is my source image: <img src="https://i.stack.imgur.com/Zb1IG.png" alt="source image"> The scans using Tesseract weren't very good, even when limiting the character set to the mentioned characters. As I didn't find any OCRB training files for Tesseract, I decided to train it myself. I created this training image and made a box file from it. The box file is correct, all letters are matched correctly. Then I did all steps described here to create the other necessary files. Using this newly trained OCR-B tessdata-set, I get pretty good results on the source image, with one little bug: All <code>1</code>s are mistaken for <code>8</code>s and vice-versa. The command used to process the image was <pre class="prettyprint"><code>$ tesseract esr2c.tif ocrb-esr2c -l ocrb </code></pre> and the output for the source image was 0800000001456>8 00000195731208 8 01050008 023+ 08 0301226>20 If you swap all <code>1</code>s and <code>8</code>s and compare it to the source image, the output would be correct (except for the last two letters which I can ignore). How could this happen? Did I do some mistake in the training process? How can I fix it?

I have trained tesseract 2.04 after 1 month efforts for OCR A extended font. Its working very well and showing above 90 Accuracy with font size 14. Training image should be high Contrast image. Use "GIMP" image editor and do following Menu Colors->Info->Histgram- Read Std Deviation value colors-> Threshould -> Write "Std Deviation value" as Threshould value Save image Use it for training. Check and edit your box file using "qt-box-editor-1.06.exe".It is very easy to use. Check All boxes and characters in it. It is very important. Somewhere in your box file has incorrect characters for 1 and 8. Run other cmds.

Tesseract confuses two numbers

Tags:

ocr

tesseract

I'm writing an application to scan numbers from an image.

The numbers are using the OCR-B font and may also contain + and > characters.

This is my source image:

source image

The scans using Tesseract weren't very good, even when limiting the character set to the mentioned characters. As I didn't find any OCRB training files for Tesseract, I decided to train it myself.

I created this training image and made a box file from it. The box file is correct, all letters are matched correctly.

Then I did all steps described here to create the other necessary files.

Using this newly trained OCR-B tessdata-set, I get pretty good results on the source image, with one little bug: All 1s are mistaken for 8s and vice-versa. The command used to process the image was

$ tesseract esr2c.tif ocrb-esr2c -l ocrb

and the output for the source image was

0800000001456>8 00000195731208 8 01050008 023+ 08 0301226>20

If you swap all 1s and 8s and compare it to the source image, the output would be correct (except for the last two letters which I can ignore).

How could this happen? Did I do some mistake in the training process? How can I fix it?

462

asked Sep 03 '11 12:09

Danilo Bargen

2 Answers

It's likely that somewhere in your box file has incorrect values (characters) for 1 and 8. You can verify using jTessBoxEditor program. If so, correct, regenerate the language data file, and try again.

answered Sep 30 '22 20:09

nguyenq

I have trained tesseract 2.04 after 1 month efforts for OCR A extended font. Its working very well and showing above 90 Accuracy with font size 14.

Training image should be high Contrast image. Use "GIMP" image editor and do following Menu Colors->Info->Histgram- Read Std Deviation value colors-> Threshould -> Write "Std Deviation value" as Threshould value Save image Use it for training.

Check and edit your box file using "qt-box-editor-1.06.exe".It is very easy to use. Check All boxes and characters in it. It is very important. Somewhere in your box file has incorrect characters for 1 and 8.

Run other cmds.

answered Sep 30 '22 20:09

yogeshjoshicolor

Related questions
                            
                                (-215:Assertion failed) !_src.empty() in function 'cv::cvtColor' with cv::imread
                            
                                Stroke Width Transform (SWT) implementation (Java, C#...) [closed]
                            
                                How to convert an image into character segments?
                            
                                Tesseract OCR Library - Learning Font
                            
                                Convert Non-Searchable Pdf to Searchable Pdf in Windows Python
                            
                                What's the best way to ocr as much text as possible from video game screenshots?
                            
                                Open source OCR [closed]
                            
                                Google Cloud Vision - Numbers and Numerals OCR
                            
                                Batch OCR Program for PDFs [closed]
                            
                                Get correct image orientation by Google Cloud Vision api (TEXT_DETECTION)
                            
                                WinError 5:Access denied PyTesseract
                            
                                Select only specific parts of the image
                            
                                Preprocessing poorly scanned handwritten digits
                            
                                Text detection on Seven Segment Display via Tesseract OCR
                            
                                Tesseract OCR fails to detect varying font size and letters that are not horizontally aligned
                            
                                How to extract text from image Android app
                            
                                Stroke Width Transform (SWT) implementation (Python)
                            
                                How can I use Tesseract in Android?
                            
                                Can I do a "string contains X" with a percentage accuracy in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With