Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract does not recognize german "für"

Tags:

ocr

tesseract

I use the tesseract 4.0 via docker image tesseractshadow/tesseract4re

I use the option -l=deu to give tesseract the hint, that the text is in "deutsch" (german).

Still the result for the german word "für" is not good. The german word is very common (meaning "for" in english).

Tesseract often detects "fiir" or "fur".

What can I do to improve this?

reproducible example

docker run --name self.container_name --rm \
    --volume  $PWD:/pwd \
    tesseractshadow/tesseract4re \
    tesseract /pwd/die-fuer-das.png /pwd/die-fuer-das.png.ocr-result -l=deu

Result:

cat die-fuer-das.png.ocr-result.txt 
die fur das

Image die_fuer_das.png:

enter image description here

like image 596
guettli Avatar asked May 24 '18 10:05

guettli


1 Answers

I found the solution. It needs to be -l deu otherwise the german language does not get used. I accidentally used -l=deu.

Works:

===> tesseract  die-fuer-das.png out  -l deu; cat out.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica
die für das

Wrong language:

===> tesseract  die-fuer-das.png out  -l=deu; cat out.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica
die fur das
like image 193
guettli Avatar answered Nov 04 '22 22:11

guettli