Am trying to extract data from reciepts and bills using Tessaract , am using tesseract 3.02 version . am using only english data , Still the output accuracy is about 60%. Is there any trained data available which i just replace in tessdata folder

This is the image nicky provided as a "typical example file": <img src="https://i.stack.imgur.com/q3Ad4.jpg" alt="typical example file"> Looking at it I'd clearly say: "Forget it, nicky! You cannot train Tesseract to recognize 100% of text from this type of image!" However, you could train yourself to make better photos with your iPhone 3GS (that's the device which was used for the example pictures) from such type of receipts. Here are a few tips: <ul> <li>Don't use a dark background. Use white instead.</li> <li>Don't let the receipt paper crumble. Straighten it out.</li> <li>Don't place the receipt loosely on an uneven underground. Fix it to a flat surface: <ul> <li>Either place it on a white sheet of paper and put a glas platen over it.</li> <li>Or use some glue and glue it flat on a white sheet of paper without any bend-up edges or corners.</li> </ul> </li> <li>Don't use a low resolution like just 640x480 pixels (as the example picture has). Use a higher one, such as 1280x960 pixels instead.</li> <li>Don't use standard exposure. Set the camera to use extremely high contrast. You want the letters to be black and the white background to be really white (you don't need the grays in the picture...)</li> <li>Try to make it so that any character of a 10-12 pt font uses about 24-30 pixels in height (that is, make the image to be about 300 dpi for 100% zoom).</li> </ul> <hr> That said, something like the following ImageMagick command will probably increase Tesseract's recognition rate by some degree: <pre class="prettyprint"><code>convert \ http://i.stack.imgur.com/q3Ad4.jpg \ -colorspace gray \ -rotate 90 \ -crop 260x540+110+75 +repage \ -scale 166% \ -normalize \ -colors 32 \ out1 .png </code></pre> It produces the following output: <img src="https://i.stack.imgur.com/5oudq.png" alt="ImageMagick optimization for OCR"> You could even add something like <code>-threshold 30%</code> as the last commandline option to above command to get this: <img src="https://i.stack.imgur.com/wunEz.png" alt="enter image description here"> (You should play a bit with some variations to the <code>30%</code> value to tweak the result... I don't have the time for this.)

Tesseract Trained data

1 Answers

This is the image nicky provided as a "typical example file":

typical example file

Looking at it I'd clearly say: "Forget it, nicky! You cannot train Tesseract to recognize 100% of text from this type of image!"

However, you could train yourself to make better photos with your iPhone 3GS (that's the device which was used for the example pictures) from such type of receipts. Here are a few tips:

Don't use a dark background. Use white instead.
Don't let the receipt paper crumble. Straighten it out.
Don't place the receipt loosely on an uneven underground. Fix it to a flat surface:
- Either place it on a white sheet of paper and put a glas platen over it.
- Or use some glue and glue it flat on a white sheet of paper without any bend-up edges or corners.
Don't use a low resolution like just 640x480 pixels (as the example picture has). Use a higher one, such as 1280x960 pixels instead.
Don't use standard exposure. Set the camera to use extremely high contrast. You want the letters to be black and the white background to be really white (you don't need the grays in the picture...)
Try to make it so that any character of a 10-12 pt font uses about 24-30 pixels in height (that is, make the image to be about 300 dpi for 100% zoom).

That said, something like the following ImageMagick command will probably increase Tesseract's recognition rate by some degree:

convert                               \    http://i.stack.imgur.com/q3Ad4.jpg \   -colorspace gray                    \   -rotate 90                          \   -crop 260x540+110+75 +repage        \   -scale 166%                         \   -normalize                          \   -colors 32                          \    out1 .png

It produces the following output:

ImageMagick optimization for OCR

You could even add something like -threshold 30% as the last commandline option to above command to get this:

enter image description here

(You should play a bit with some variations to the 30% value to tweak the result... I don't have the time for this.)

173

answered Oct 18 '22 21:10

Kurt Pfeifle

Related questions
                            
                                Difference between double and Double in comparison
                            
                                Using ng-repeat with table rows
                            
                                Little Schemer and Racket
                            
                                What's the difference on docstrings with triple SINGLE quotes and triple DOUBLE quotes?
                            
                                /bin/env: python: No such file or directory (Windows through Git Bash trying to install new Parse Cloud Code)
                            
                                Lcov: can not collect branch coverage statistics
                            
                                Extract string with Python re.match
                            
                                Magento getParam v $_GET
                            
                                Get Current Location On Google Map
                            
                                How to make an image fit into a circular frame in android
                            
                                Eclipse on Mac 10.8 - Installed 1.7.0 JRE / JDK, but Eclipse won't launch
                            
                                how to generate Narcissistic numbers faster?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tesseract Trained data

Tags:

nicky

People also ask

1 Answers

Kurt Pfeifle

Recent Activity

Donate For Us