Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Image Preprocessing before OCR process

My current project involves transcribing texts in pdf into text files, and I first tried putting the image file directly into OCR program (tesseract) and it didnt' do that well. The original image files are old news papers, basically, and have some background noises, which I am sure tesseract has problem with. So I am trying to use some image preprocessing before feeding it into tesseract. Is there any suggestion for open source image preprocessing engine that fits well to this situation??? And instructions on how to use it would be even more appreciated !

like image 530
Sardonic Avatar asked Jun 30 '26 16:06

Sardonic


1 Answers

I never heard of an "image preprocessing engine" for that purpose, but you can take a look at OpenCV (Open Source Computer Vision Library) and implement your own "pre-processing engine". OpenCV is a computer vision library that offers many features to perform image processing.

One interesting thing you might want test as a preprocessing step is apply a threshold to the image to remove noises and stuff. Anyway, I've talked about this kind of stuff in this thread.

like image 101
karlphillip Avatar answered Jul 03 '26 20:07

karlphillip



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!