I have a problem with running tesseract-ocr engine on linux. I've downloaded RUS language data and put it to tessdata directory (/usr/local/share/tessdata). When I'm trying to run tesseract with command tesseract blob.jpg out -l rus
, it displays an error:
Error opening data file /usr/local/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language eng Tesseract couldn't load any languages! Could not initialize tesseract.
According to compiling guide, I used export TESSDATA_PREFIX='/usr/local/share/'
to point my tessdata directory. Maybe I should edit any config files? Tesseract try to load 'eng' data files instead of 'rus'.
Screenshot: http://i.stack.imgur.com/I0Guc.png
Additional Language Support To install additional languages into Islandora, you will need to know the path to your Tesseract installation's 'tessdata' folder. On Windows, this will tend to be C:\Program Files (x86)\Tesseract OCR\tessdata, if you've used the Tesseract website's own installation case.
The --oem argument, or OCR Engine Mode, controls the type of algorithm used by Tesseract. The --psm controls the automatic Page Segmentation Mode used by Tesseract.
You can grab eng.traineddata
Github:
wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
Check https://github.com/tesseract-ocr/tessdata for a full list of trained language data.
When you grab the file(s), move them to the /usr/local/share/tessdata
folder. Warning: some Linux distributions (such as openSUSE and Ubuntu) may be expecting it in /usr/share/tessdata
instead.
# If you got the data from Google, unzip it first! gunzip eng.traineddata.gz # Move the data sudo mv -v eng.traineddata /usr/local/share/tessdata/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With