C++ - Disappointing performance with Tesseract

Tags:

The company I'm working at is considering switching its current OCR engine (Nuance's OmniPage) to an open-source alternative such as Tesseract.

In the hope of getting some performance benchmarks (execution speed and accuracy) to compare both, I got a very simple program working just to get an idea of how well the Tesseract 3.2 C API would perform.

My initial observations (some of them may be off, feel free to correct my interpretations in the comments):

The accuracy was good. It compares very well to our current engine.
The output formats only offer the recognized text instead of a preview of where the text was in the original image. There is the possibility of taking the hOCR format and converting it to something else a bit more visually appealing, but I have failed to find open-source converters on Windows suitable for commercial use (I couldn't find the source or executable from ExactCODE's hocr2pdf). I could write a simple script that would convert the detected bbox for every paragraph and just add a style element that sets the left, right, width, height and position attributes of the HTML tag, so this limitation is minor.
Leptonica (the image library used by Tesseract) appears not to support reading complex PDF files. Although this does add a minor development overhead to the migration (since it is not out-of-the-box), it is not so much of a problem since we already have modules in our product to extract the images from PDF files.
The execution speed was extremely slow (in comparison to Nuance's OmniPage, at least). It took Tesseract a little bit more than 2 minutes on my machine to convert the file at the end of this question. It took Nuance's OmniPage less than 3 minutes 30 seconds to convert 10 large documents (including the image provided). (I don't remember exactly how long, but it clearly was less than ~15 seconds for the provided image only)

If it was only about the other factors, the migration could probably be done without too much problem. This performance limitation, however, is a killer.

Then, I thought to myself: how could Tesseract perform so poorly compared to its commercial equivalents? Google would most certainly strive for performance.

So, I'm almost sure that the problem comes from me. I'm either not using the API in the right way, I'm not changing settings that I should or something else that I'm just missing right now.

Here is the section of my test program related to Tesseract:

#include "baseapi.h"
#include "allheaders.h"

// ...
// Tesseract initialization
tesseract::TessBaseAPI api;
api.Init("", "eng+deu+fra");
api.SetPageSegMode(tesseract::PageSegMode::PSM_AUTO_OSD);
api.SetVariable("tessedit_create_hocr", "1"); // for the hOCR output

// ...
// OCR
PIX* pixs = pixRead(image_path.c_str());
STRING result;
api.ProcessPages(image_path.c_str(), NULL, 0, &result);

// ... write the result to a file

I tried with different page segmentation modes and without the creation of the hocr format activated, only to be just as disappointed as before. I also tried applying some pre-processing scripts to the image to see if it would help a bit for the detection, but without success. I tried with only one dictionary for testing purposes, but it didn't have big repercussions on the performance. I had the same performance problems with multi-page TIF files and single-page TIF images as well, and did not try other formats yet.

Quick profiling of the application with VerySleepy showed that most of the execution time was spent on news and deletes related to bounding boxes.

I would really like us to migrate to an open-source library rather than a commercial product, so I would appreciate if anyone could help me achieve better performance with the API. Unless I can get dramatic improvements to get performance results similar to the current engine, the migration won't happen.

Thank you very much for your precious time.

Here is an image from my test set:

Sample OCR image

219

asked Jul 11 '13 17:07

Jesse Emond

1 Answers

I don't think that you can do something about it. That is right, Tesseact is incredibly slow compared to commercial engines like OmniPage or ABBYY. Every comparison test shows that. Those companies are doing OCR for a living and are very serious about speed, accuracy and other factors.

143

answered Oct 18 '22 22:10

Tomato

Related questions
                            
                                C++ init-list: using non-initialized members to initialize others gives no warning
                            
                                const keyword position in function declaration [duplicate]
                            
                                LLVM JIT Parser writing with Bison / Antlr / Packrat / Elkhound /
                            
                                can rethrow_exception really throw the same exception object, rather than a copy?
                            
                                How to cast to it boost::bind(&myClass::fun, this, _1, _2, _3) to typedef void (*fun)(arg1, arg2, arg3)?
                            
                                boost::deadline_timer can fail when system clock is modified
                            
                                Cast vs ToXXX for value handles in v8
                            
                                KDevelop with CMake project - how to manage debug and release builds?
                            
                                Const temporary from template type and why use std::add_const?
                            
                                C++ Dynamically load arbitrary function from DLL into std::function
                            
                                How is `new (std::nothrow)` implemented?
                            
                                d3dx11.lib not found?
                            
                                static constexpr pointer-to-function, difference between compilers
                            
                                SEH exception when using googlemock
                            
                                Why can't I template overload?
                            
                                What is the difference between logicalDpiX and physicalDpiX in Qt?
                            
                                C++ Java static final equivalent
                            
                                C++ standard library implementations in different compilers
                            
                                Entry Point Not Found
                            
                                Does a VAO remember both a EBO/IBO (elements or indices) and a VBO?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

C++ - Disappointing performance with Tesseract

Tags:

c++

performance

ocr

tesseract

Jesse Emond

People also ask

1 Answers

Tomato

Recent Activity

Donate For Us