I have to analyzed a image which containing both English and Japanese texts. When I run tesseract by default (-l eng
), some Japanese characters lost. Otherwise, if I run tesseract with japanese (-l jpn
) some English characters lost (e.g. Email).
How can I run one process which recognize both English and Japanese characters?
In fact, Tesseract supports over 100 languages, including those that comprise characters and symbols, as well as right-to-left languages.
The Tesseract OCR engine supports multiple languages. To detect characters from a specific language, the language needs to be specified while creating the OCR engine itself. English, German, Spanish, French and Italian languages come embedded with the action so they do not require additional parameters.
Since tesseract 3.02 it is possible to specify multiple languages for the -l parameter. -l lang The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters.
Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.
Since tesseract 3.02 it is possible to specify multiple languages for the -l parameter.
-l lang The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes.
An example:
tesseract myscan.png out -l deu+eng
Try this:
custom_config = r'-l eng+jpn --psm 6' txt = pytesseract.image_to_string(img, config=custom_config) from langdetect import detect_langs detect_langs(txt)
Note: you have to install langdetect by using:
pip install langdetect
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With