Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What OCR options exist beyond Tesseract? [closed]

I've used Tesseract a bit and it's results leave much to be desired. I'm currently detecting very small images (35x15, without border, but have tried adding one with imagemagick with no ocr advantage); they range from 2 chars to 5 and are a pretty reliable font, however the characters are variable enough that simply using an image size checksum or such is not going to work.

What options exist for OCR besides sticking with Tesseract or doing a complete custom training of it? Also, it would be VERY helpful if this were compatible with Heroku style hosting (at least where I can compile the bins and shove them over).

like image 776
ylluminate Avatar asked Mar 13 '12 19:03

ylluminate


People also ask

Why is the Tesseract OCR not accurate?

Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.

Is EasyOCR better than Tesseract?

In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google. On the other hand, EasyOCR is detailed as "Ready-to-use OCR with 40 languages". It is ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai.

How accurate is Tesseract OCR?

Combinations of the first three preprocessing actions are said to boost the accuracy of Tesseract 4.0 from 70.2% to 92.9%.

Can Tesseract OCR a PDF?

Tesseract reads only image files, not pdf. You can convert PDF to image (tif, png) and OCR those. Or use wrappers that use tesseract.


1 Answers

I have successfully used GOCR in the past for small image OCR. I would say accuracy was around 85%, after getting the grayscale options set properly, on fairly regular fonts. It fails miserably when the fonts get complicated and has trouble with multiline layouts.

Also have a look at Ocropus, which is maintained by Google. Its related to Tesseract, but from what I understand, its OCR engine is different. With just the default models included, it achieves near 99% accuracy on high-quality images, handles layout pretty well and provides HTML output with information concerning formatting and lines. However, in my experience, its accuracy is very low when the image quality is not good enough. That being said, training is relatively simple and you might want to give it a try.

Both of them are easily callable from the command line. GOCR usage is very straightforward; just type gocr -h and you should have all the information you need. Ocropus is a bit more tricky; here's a usage example, in Ruby:

require 'fileutils' tmp = 'directory' file = 'file.png'  `ocropus book2pages #{tmp}/out #{file}` `ocropus pages2lines #{tmp}/out` `ocropus lines2fsts #{tmp}/out` `ocropus buildhtml #{tmp}/out > #{tmp}/output.html`  text = File.read("#{tmp}/output.html") FileUtils.rm_rf(tmp) 
like image 121
user2398029 Avatar answered Sep 22 '22 11:09

user2398029