Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ - Disappointing performance with Tesseract

The company I'm working at is considering switching its current OCR engine (Nuance's OmniPage) to an open-source alternative such as Tesseract.

In the hope of getting some performance benchmarks (execution speed and accuracy) to compare both, I got a very simple program working just to get an idea of how well the Tesseract 3.2 C API would perform.

My initial observations (some of them may be off, feel free to correct my interpretations in the comments):

  • The accuracy was good. It compares very well to our current engine.
  • The output formats only offer the recognized text instead of a preview of where the text was in the original image. There is the possibility of taking the hOCR format and converting it to something else a bit more visually appealing, but I have failed to find open-source converters on Windows suitable for commercial use (I couldn't find the source or executable from ExactCODE's hocr2pdf). I could write a simple script that would convert the detected bbox for every paragraph and just add a style element that sets the left, right, width, height and position attributes of the HTML tag, so this limitation is minor.
  • Leptonica (the image library used by Tesseract) appears not to support reading complex PDF files. Although this does add a minor development overhead to the migration (since it is not out-of-the-box), it is not so much of a problem since we already have modules in our product to extract the images from PDF files.
  • The execution speed was extremely slow (in comparison to Nuance's OmniPage, at least). It took Tesseract a little bit more than 2 minutes on my machine to convert the file at the end of this question. It took Nuance's OmniPage less than 3 minutes 30 seconds to convert 10 large documents (including the image provided). (I don't remember exactly how long, but it clearly was less than ~15 seconds for the provided image only)

If it was only about the other factors, the migration could probably be done without too much problem. This performance limitation, however, is a killer.

Then, I thought to myself: how could Tesseract perform so poorly compared to its commercial equivalents? Google would most certainly strive for performance.

So, I'm almost sure that the problem comes from me. I'm either not using the API in the right way, I'm not changing settings that I should or something else that I'm just missing right now.

Here is the section of my test program related to Tesseract:

#include "baseapi.h"
#include "allheaders.h"

// ...
// Tesseract initialization
tesseract::TessBaseAPI api;
api.Init("", "eng+deu+fra");
api.SetPageSegMode(tesseract::PageSegMode::PSM_AUTO_OSD);
api.SetVariable("tessedit_create_hocr", "1"); // for the hOCR output

// ...
// OCR
PIX* pixs = pixRead(image_path.c_str());
STRING result;
api.ProcessPages(image_path.c_str(), NULL, 0, &result);

// ... write the result to a file

I tried with different page segmentation modes and without the creation of the hocr format activated, only to be just as disappointed as before. I also tried applying some pre-processing scripts to the image to see if it would help a bit for the detection, but without success. I tried with only one dictionary for testing purposes, but it didn't have big repercussions on the performance. I had the same performance problems with multi-page TIF files and single-page TIF images as well, and did not try other formats yet.

Quick profiling of the application with VerySleepy showed that most of the execution time was spent on news and deletes related to bounding boxes.

I would really like us to migrate to an open-source library rather than a commercial product, so I would appreciate if anyone could help me achieve better performance with the API. Unless I can get dramatic improvements to get performance results similar to the current engine, the migration won't happen.

Thank you very much for your precious time.

Here is an image from my test set:

Sample OCR image

like image 219
Jesse Emond Avatar asked Jul 11 '13 17:07

Jesse Emond


People also ask

Why is the Tesseract OCR not accurate?

Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.

How accurate is Tesseract?

Combinations of the first three preprocessing actions are said to boost the accuracy of Tesseract 4.0 from 70.2% to 92.9%.

Which is better EasyOCR or Tesseract?

In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google. On the other hand, EasyOCR is detailed as "Ready-to-use OCR with 40 languages". It is ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai.


1 Answers

I don't think that you can do something about it. That is right, Tesseact is incredibly slow compared to commercial engines like OmniPage or ABBYY. Every comparison test shows that. Those companies are doing OCR for a living and are very serious about speed, accuracy and other factors.

like image 143
Tomato Avatar answered Oct 18 '22 22:10

Tomato