TL;DR It appears that tesseract cannot recognize images consisting of a single digit. Is there a workaround/reason for this?
I am using (the digits only version of) tesseract to automate inputting invoices to the system. However, I noticed that tesseract seems to be unable to recognize single digit numbers such as the following:
The raw scan after crop is:

After I did some image enhancing:

It works fine if it has at least two digits:

I've tested on a couple of other figures:
Not working:
,
,

Working:
,
,

If it helps, for my purpose all inputs to tesseract has been cropped and rotated like above. I am using pyocr as a bridge between my project and tesseract.
Here's how you can configure pyocr to recognize individual digits:
from PIL import Image
import sys
import pyocr
import pyocr.builders
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
tool = tools[0]
im = Image.open('digit.png')
builder = pyocr.builders.DigitBuilder()
# Set Page Segmentation mode to Single Char :
builder.tesseract_layout = 10 # If tool = tesseract
builder.tesseract_flags = ['-psm', '10'] # If tool = libtesseract
result = tool.image_to_string(im, lang="eng", builder=builder)
Individual digits are handled the same way as other characters, so changing the page segmentation mode should help to pick up the digits correctly.
See also: Tesseract does not recognize single characters
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With