Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect language or script from an input image using Python or Tesseract OCR?

Given an input image which can be in any language or writing system, how do I detect what script the text in the picture uses?

Any Python-based or Tesseract-OCR based solution would be appreciated.


Note that script here means writing systems like Latin, Cyrillic, Devanagari, etc., for corresponding languages like English, Russian, Hindi, etc. (respectively)

like image 576
Gokul NC Avatar asked Nov 19 '25 20:11

Gokul NC


1 Answers

Pre-requisites:

  • Install Tesseract: sudo apt install tesseract-ocr tesseract-ocr-all
  • Install PyTessract: pip install pytesseract

Script-Detection:

import pytesseract
import re

def detect_image_lang(img_path):
    try:
        osd = pytesseract.image_to_osd(img_path)
        script = re.search("Script: ([a-zA-Z]+)\n", osd).group(1)
        conf = re.search("Script confidence: (\d+\.?(\d+)?)", osd).group(1)
        return script, float(conf)
    except e:
        return None, 0.0

script_name, confidence = detect_image_lang("image.png")

Language-Detection:

After performing OCR (using Tesseract), pass the text through langdetect library (or any other lib).

like image 132
Gokul NC Avatar answered Nov 23 '25 17:11

Gokul NC



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!