I use the tesseract 4.0 via docker image tesseractshadow/tesseract4re
I use the option -l=deu
to give tesseract the hint, that the text is in "deutsch" (german).
Still the result for the german word "für" is not good. The german word is very common (meaning "for" in english).
Tesseract often detects "fiir" or "fur".
What can I do to improve this?
reproducible example
docker run --name self.container_name --rm \
--volume $PWD:/pwd \
tesseractshadow/tesseract4re \
tesseract /pwd/die-fuer-das.png /pwd/die-fuer-das.png.ocr-result -l=deu
Result:
cat die-fuer-das.png.ocr-result.txt
die fur das
Image die_fuer_das.png:
I found the solution. It needs to be -l deu
otherwise the german language does not get used. I accidentally used -l=deu
.
Works:
===> tesseract die-fuer-das.png out -l deu; cat out.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica
die für das
Wrong language:
===> tesseract die-fuer-das.png out -l=deu; cat out.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica
die fur das
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With