I'm trying to OCR a lot of documents(I mean in 300k + range a day). At the moment i'm using Tesseract wrapper for .NET and it's all good in quality but the speed is not good enough. The times i get for 20 tasks in parallel scanning of a half page from the same pdf in average are 2,546 second per scan. The code im using:
using (var engine = new TesseractEngine(Tessdata, "eng", EngineMode.TesseractOnly))
{
Page page;
page = engine.Process(image, srcRect);
var text = page.GetText();
return Task.FromResult(text);
}
The average time i get is after lowering the resolution of image by half and converting it to grayscale. Any ideas to speed up the process? I don't need to have text segmentated, just the text in one line. Should i maybe use something as Matlab for c#?
Bookmark this question. Show activity on this post. I'm trying to create a real time OCR in python using mss and pytesseract .
Tesseract is performing well for high-resolution images. Certain morphological operations such as dilation, erosion, OTSU binarization can help increase pytesseract performance. EasyOCR is lightweight model which is giving a good performance for receipt or PDF conversion.
Input Data for GPU test: Below are the results: In terms of speed, Tesseract outperforms EasyOCR on CPU, while EasyOCR performs amazingly on GPU.
Then it is fast because it uses more than one CPU core for some time consuming parts of the OCR process. For mass OCR, it does not help. If many pages have to be processed, it is better to use single threaded Tesseract and run several Tesseract processes in parallel.
Any way to make it faster. Any ideas on how to make Tesseract read faster? You can already run 4 parallel instances of Tesseract on your quad core, then it will read 4 images in about the same time. Introducing multi threading would not help to reduce the time needed for an OCR of many images.
You can already run 4 parallel instances of Tesseract on your quad core, then it will read 4 images in about the same time. Introducing multi threading would not help to reduce the time needed for an OCR of many images.
For mass OCR, it does not help. If many pages have to be processed, it is better to use single threaded Tesseract and run several Tesseract processes in parallel. Sorry, something went wrong. Stefan, what about using OpenMP for training?
Currently, you create a new TesseractEngine
object for each page you scan. Creating the engine is costly because it reads the 'tessdata' files.
You say you have 20 parallel tasks running. Since the engine cannot process multiple pages at once you will need to create one engine per task and reuse it for all the pages that task processes. You can simply call using (var page = Engine.Process(pix))
to process the next page with an existing engine.
Reusing the engine should significantly improve performance because you'll only have to create 20 engines instead of 300k.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With