Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract OCR - recognize checkboxes as word

Tags:

ocr

tesseract

for a customer I want to teach Tesseract to recognize checkboxes as a word. It worked fine when Tesseract should recognize a empty checkbox.

This command in combination with this tutorial worked like a charm and Tesseract was able to find empty checkboxes and interpret them to "[_]":

tesseract -psm 10 deu2.unchecked1.exp0.JPG deu2.unchecked1.exp0.box nobatch box.train

Here is my command to successful analyze a document:

tesseract test.png test -l deu1+deu2

Then I tried to train a checked checkbox, but got this error:

Tesseract Open Source OCR Engine v3.04.00 with Leptonica
FAIL!
APPLY_BOXES: boxfile line 1/[X] ((60,30),(314,293)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:       1
   Boxes failed resegmentation:       1
   Found 0 good blobs.
Generated training data for 0 words

Does anyone have an idea how to teach Tesseract recognize checked checkboxes as well?

Thank you in advance!

like image 552
Christoph Bätz Avatar asked Oct 29 '22 23:10

Christoph Bätz


1 Answers

After much more tries I figured out that it is of course possible to teach Tesseract different kind of letters. But as I know today, there is no possibility to teach Tesseract a sign which is not conform to some "visual rules" of a letter. For example: A letter is always one connected line of ink, at most a combination of ink and "something outside it" (for example: i,ä,ö,ü) Problem here ist that there is nothing what is similiat to checkbox (one object in antother object) This leads for Tesseract to irritations and crashes.

like image 99
Christoph Bätz Avatar answered Nov 12 '22 17:11

Christoph Bätz