Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect unicode character range in Python?

Tags:

python

unicode

I have a corpus of text documents that contain text in more than one language.

For each line I read, I have to find out which language it is written in. This is limited to three languages, viz, English, Hindi (U+0900–U+097F) and Telugu (U+0C00–U+0C7F).

How can I make my program filter the lines with different script?

like image 839
Aditya Avatar asked Jan 27 '26 02:01

Aditya


1 Answers

Use max() to pick out the highest codepoint used, then match that against your ranges:

def detect_language(line):
    maxchar = max(line)
    if u'\u0c00' <= maxchar <= u'\u0c7f':
        return 'telugu'
    elif u'\u0900' <= maxchar <= u'\u097f':
        return 'hindi'
    return 'english'

Demo:

>>> detect_language(u'Hello world!')
'english'
>>> detect_language(u'తెలుు')
'telugu'
>>> detect_language(u'हिन्दी')
'hindi'
like image 57
Martijn Pieters Avatar answered Jan 29 '26 16:01

Martijn Pieters



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!