Am trying to extract data from reciepts and bills using Tessaract , am using tesseract 3.02 version .
am using only english data , Still the output accuracy is about 60%.
Is there any trained data available which i just replace in tessdata folder
Luckily, you can train your Tesseract so it can read your font easily.
Tesseract pre-trained models You can download the pre-created ones designed to be fast and consume less memory, as well as the ones requiring more in terms of resources but giving a better accuracy.
Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.
This is the image nicky provided as a "typical example file":
Looking at it I'd clearly say: "Forget it, nicky! You cannot train Tesseract to recognize 100% of text from this type of image!"
However, you could train yourself to make better photos with your iPhone 3GS (that's the device which was used for the example pictures) from such type of receipts. Here are a few tips:
That said, something like the following ImageMagick command will probably increase Tesseract's recognition rate by some degree:
convert \ http://i.stack.imgur.com/q3Ad4.jpg \ -colorspace gray \ -rotate 90 \ -crop 260x540+110+75 +repage \ -scale 166% \ -normalize \ -colors 32 \ out1 .png
It produces the following output:
You could even add something like -threshold 30%
as the last commandline option to above command to get this:
(You should play a bit with some variations to the 30%
value to tweak the result... I don't have the time for this.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With