The company I'm working at is considering switching its current OCR engine (Nuance's OmniPage) to an open-source alternative such as Tesseract.
In the hope of getting some performance benchmarks (execution speed and accuracy) to compare both, I got a very simple program working just to get an idea of how well the Tesseract 3.2 C API would perform.
My initial observations (some of them may be off, feel free to correct my interpretations in the comments):
style
element that sets the left
, right
, width
, height
and position
attributes of the HTML tag, so this limitation is minor.If it was only about the other factors, the migration could probably be done without too much problem. This performance limitation, however, is a killer.
Then, I thought to myself: how could Tesseract perform so poorly compared to its commercial equivalents? Google would most certainly strive for performance.
So, I'm almost sure that the problem comes from me. I'm either not using the API in the right way, I'm not changing settings that I should or something else that I'm just missing right now.
Here is the section of my test program related to Tesseract:
#include "baseapi.h"
#include "allheaders.h"
// ...
// Tesseract initialization
tesseract::TessBaseAPI api;
api.Init("", "eng+deu+fra");
api.SetPageSegMode(tesseract::PageSegMode::PSM_AUTO_OSD);
api.SetVariable("tessedit_create_hocr", "1"); // for the hOCR output
// ...
// OCR
PIX* pixs = pixRead(image_path.c_str());
STRING result;
api.ProcessPages(image_path.c_str(), NULL, 0, &result);
// ... write the result to a file
I tried with different page segmentation modes and without the creation of the hocr format activated, only to be just as disappointed as before. I also tried applying some pre-processing scripts to the image to see if it would help a bit for the detection, but without success. I tried with only one dictionary for testing purposes, but it didn't have big repercussions on the performance. I had the same performance problems with multi-page TIF files and single-page TIF images as well, and did not try other formats yet.
Quick profiling of the application with VerySleepy showed that most of the execution time was spent on new
s and delete
s related to bounding boxes.
I would really like us to migrate to an open-source library rather than a commercial product, so I would appreciate if anyone could help me achieve better performance with the API. Unless I can get dramatic improvements to get performance results similar to the current engine, the migration won't happen.
Thank you very much for your precious time.
Here is an image from my test set:
Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.
Combinations of the first three preprocessing actions are said to boost the accuracy of Tesseract 4.0 from 70.2% to 92.9%.
In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google. On the other hand, EasyOCR is detailed as "Ready-to-use OCR with 40 languages". It is ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai.
I don't think that you can do something about it. That is right, Tesseact is incredibly slow compared to commercial engines like OmniPage or ABBYY. Every comparison test shows that. Those companies are doing OCR for a living and are very serious about speed, accuracy and other factors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With