Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up tesseract OCR

I'm trying to OCR a lot of documents(I mean in 300k + range a day). At the moment i'm using Tesseract wrapper for .NET and it's all good in quality but the speed is not good enough. The times i get for 20 tasks in parallel scanning of a half page from the same pdf in average are 2,546 second per scan. The code im using:

using (var engine = new TesseractEngine(Tessdata, "eng", EngineMode.TesseractOnly))
        {
            Page page;
            page = engine.Process(image, srcRect);        
            var text = page.GetText();
            return Task.FromResult(text);
        }

The average time i get is after lowering the resolution of image by half and converting it to grayscale. Any ideas to speed up the process? I don't need to have text segmentated, just the text in one line. Should i maybe use something as Matlab for c#?

like image 210
TestzWCh Avatar asked Jun 02 '17 07:06

TestzWCh


People also ask

Is Pytesseract slow?

Bookmark this question. Show activity on this post. I'm trying to create a real time OCR in python using mss and pytesseract .

Is Easy OCR better than Tesseract?

Tesseract is performing well for high-resolution images. Certain morphological operations such as dilation, erosion, OTSU binarization can help increase pytesseract performance. EasyOCR is lightweight model which is giving a good performance for receipt or PDF conversion.

Is EasyOCR faster than Tesseract?

Input Data for GPU test: Below are the results: In terms of speed, Tesseract outperforms EasyOCR on CPU, while EasyOCR performs amazingly on GPU.

Why is tesseract so fast for OCR?

Then it is fast because it uses more than one CPU core for some time consuming parts of the OCR process. For mass OCR, it does not help. If many pages have to be processed, it is better to use single threaded Tesseract and run several Tesseract processes in parallel.

Is there any way to make tesseract faster?

Any way to make it faster. Any ideas on how to make Tesseract read faster? You can already run 4 parallel instances of Tesseract on your quad core, then it will read 4 images in about the same time. Introducing multi threading would not help to reduce the time needed for an OCR of many images.

Can tesseract read multiple images at the same time?

You can already run 4 parallel instances of Tesseract on your quad core, then it will read 4 images in about the same time. Introducing multi threading would not help to reduce the time needed for an OCR of many images.

Does OpenMP help with mass OCR?

For mass OCR, it does not help. If many pages have to be processed, it is better to use single threaded Tesseract and run several Tesseract processes in parallel. Sorry, something went wrong. Stefan, what about using OpenMP for training?


1 Answers

Currently, you create a new TesseractEngine object for each page you scan. Creating the engine is costly because it reads the 'tessdata' files.

You say you have 20 parallel tasks running. Since the engine cannot process multiple pages at once you will need to create one engine per task and reuse it for all the pages that task processes. You can simply call using (var page = Engine.Process(pix)) to process the next page with an existing engine.

Reusing the engine should significantly improve performance because you'll only have to create 20 engines instead of 300k.

like image 89
GWigWam Avatar answered Oct 24 '22 05:10

GWigWam