Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract OCR loading a language - Japanese

Tags:

tesseract

I just installed Tesseract OCR and after running the command $ tesseract --list-langs the output showed only 2 languages, eng and osd. My question is, how do I load another language, in my case specifically, Japanese?

like image 924
Freddy Avatar asked Aug 16 '17 15:08

Freddy


People also ask

Does Tesseract support Japanese?

The only language pack installed in macOS Tesseract is English, which is contained in the eng. traineddata file.

How do you specify a language in Tesseract?

Since tesseract 3.02 it is possible to specify multiple languages for the -l parameter. -l lang The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters.

Can Tesseract detect language?

Unfortunately tesseract does not have a feature to detect language of the text in an image automatically. An alternative solution is provided by another python module called langdetect which can be installed via pip.

What languages does Tesseract support?

The Tesseract OCR engine supports multiple languages. To detect characters from a specific language, the language needs to be specified while creating the OCR engine itself. English, German, Spanish, French and Italian languages come embedded with the action so they do not require additional parameters.


3 Answers

I learned that by grabbing the trained data from https://github.com/tesseract-ocr/tessdata and placing it in the same directory as the other trained data, i.e., eng.traineddata and by passing the language flag -l LANG tesseract should be able to read the language you've specified, in the following example, Japanese: tesseract -l jpn sample-jpn.png output-jpn.

like image 191
Freddy Avatar answered Jan 01 '23 10:01

Freddy


This works for me:

sudo apt-get install tesseract-ocr-jpn

hope this will help.

like image 43
Harald Avatar answered Jan 01 '23 09:01

Harald


1. pip install pytesseract

2. for windows install tesseract-ocr from 
https://digi.bib.uni-mannheim.de/tesseract
select all language options while installing

3. set the tesseract-ocr path under anaconda/lib/site-packages/pytesseract/pytesseract.py

tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

4. from pytesseract import image_to_string
print(image_to_string(test_file, 'jpn')) #for Japenese text extraction
like image 20
Amir Avatar answered Jan 01 '23 08:01

Amir