How to improve OCR Accuracy which use tesseract? [duplicate]

Question

I've been using tesseract to convert documents into text. The quality of the documents ranges wildly, and I'm looking for tips on what sort of image processing might improve the results. I've noticed that text that is highly pixellated - for example that generated by fax machines - is especially difficult for tesseract to process - presumably all those jagged edges to the characters confound the shape-recognition algorithms.

What sort of image processing techniques would improve the accuracy? I've been using a Gaussian blur to smooth out the pixellated images and seen some small improvement, but I'm hoping that there is a more specific technique that would yield better results. Say a filter that was tuned to black and white images, which would smooth out irregular edges, followed by a filter which would increase the contrast to make the characters more distinct.

Any general tips for someone who is a novice at image processing?

user898678 · Accepted Answer

fix DPI (if needed) 300 DPI is minimum
fix text size: e.g. 12 pt should be ok for tesseract 3.x (a.k.a as legacy engine) new: best accuracy with tesseract >= 4.x (LSTM engine) is with height of capital letters at 30-33 pixels
try to fix text lines (deskew and dewarp text)
try to fix illumination of image (e.g. no dark part of image)
binarize and de-noise image

There is no universal command line that would fit to all cases (sometimes you need to blur and sharpen image). But you can give a try to TEXTCLEANER from Fred's ImageMagick Scripts.

If you are not fan of command line, maybe you can try to use opensource scantailor.sourceforge.net or commercial bookrestorer.

John · Answer

I am by no means an OCR expert. But I this week had the need to convert text out of a jpg.

I started with a colorized, RGB 445x747 pixel jpg. I immediately tried tesseract on this, and the program converted almost nothing. I then went into GIMP and did the following.

image > mode > grayscale
image > scale image > 1191x2000 pixels
filters > enhance > unsharp mask with values of
radius = 6.8, amount = 2.69, threshold = 0

I then saved as a new jpg at 100% quality.

Tesseract then was able to extract all the text into a .txt file

Gimp is your friend.

How to improve OCR Accuracy which use tesseract? [duplicate]

Tags:

image-processing

ocr

tesseract

user364902

2 Answers

user898678

John

Recent Activity

Donate For Us

How to improve OCR Accuracy which use tesseract? [duplicate]

Tags:

image-processing

ocr

tesseract

user364902

2 Answers

user898678

John

Related questions

Recent Activity

Donate For Us