Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Howto: Improve the PDF- quality before OCR using C#

Tags:

c#

pdf

ocr

readable

I'm creating a service that monitors a folder for scanned files. Once the file is there, The service picks it up, and convert it to a readable PDF. In this process the service also searches for a barcode. After this, the text is extracted and the file, with its text is stored into the database of our software. The location is based on the barcode.

Now, for the OCR we are using the SDK of Atalasoft (http://www.atalasoft.com/). Also the Barcode recognizer is included in this SDK.

But the converted text still has some mistakes. (I ran some tests with other OCR-programs, but Atalasoft came out nice.) I'm looking for some software (SDK-kit) which allows me to improve the quality of the PDF for OCR purposes.

I tested Kofax VRS Elite (http://www.kofax.com/vrs-virtualrescan/). I'm looking for something similar, but that can be implemented in the service using some kind of SDK-kit.

Anyone who did this before, or had similar problems? thx in advance!

like image 976
Anthony Claeys Avatar asked Nov 04 '22 18:11

Anthony Claeys


1 Answers

You may try and follow a different path altogether:
See if you can configure the scanner(s) to scan directly to PDF and do the OCR on the fly. The Lexmark scanners can do this. This creates PDF's with selectable and searchable text. This in turn can be extracted with a PDF reading library.

Alternatively you may want to have a look at http://www.abbyy.com/ and see if you get better results.

If these are not good options, you may want to break down your problem in a systematic way:
1. Is the image quality of the scanned images the problem? If so, then this will have to be fixed first. Your OCR solution may be affected by resolution, contrast, and colour.
2. Is it the OCR software? Take a highly legible document and see if the OCR software makes mistakes. If so, then you know you have to find better OCR software.
3. If your document quality is decent and your OCR software has a high success rate in deciphering a legible document, then you may want to look at the exceptions that do not work, and tackle these on a case by case basis.

If smears and background images on documents is the cause of the problem, you may want to look into ways of avoiding this, or cleaning this with image processing software that exposes an API.

like image 187
Jack Avatar answered Nov 14 '22 18:11

Jack