Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find out the unicode script of a character

Tags:

python

unicode

Given a unicode character what would be the simplest way to return its script (as "Latin", "Hangul" etc)? unicodedata doesn't seem to provide this kind of feature.

like image 525
georg Avatar asked Mar 26 '12 08:03

georg


People also ask

How do I find Unicode for a character?

To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X.

How do I find the Unicode code in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

How many scripts are in Unicode?

Q: How many characters are in Unicode? The short answer is that as of Version 15.0, the Unicode Standard contains 149,186 characters.

What is a Unicode character code?

The Unicode character encoding standard is a fixed-length, character encoding scheme that includes characters from almost all of the living languages of the world. Information about Unicode can be found in The Unicode Standard , and from the Unicode Consortium website at www.unicode.org.


2 Answers

I was hoping someone's done it before, but apparently not, so here's what I've ended up with. The module below (I call it unicodedata2) extends unicodedata and provides script_cat(chr) which returns a tuple (Script name, Category) for a unicode char. Example:

# coding=utf8
import unicodedata2
print unicodedata2.script_cat(u'Ф')  #('Cyrillic', 'L')
print unicodedata2.script_cat(u'の')  #('Hiragana', 'Lo')
print unicodedata2.script_cat(u'★')  #('Common', 'So')

The module: https://gist.github.com/2204527

like image 91
georg Avatar answered Oct 08 '22 06:10

georg


It seems to me that the Python unicodedata module contains tools for accessing the main file in the Unicode database but nothing for the other files: “The data in this database is based on the UnicodeData.txt file”

The script information is in the Scripts.txt file. It is of relatively simple format (described in UAX #44) and not horribly large (131 kilobytes), so you might consider parsing it in your program. Note that in the Unicode classification, there’s the “Common” script that contains characters used in different scripts, like punctuation marks.

like image 31
Jukka K. Korpela Avatar answered Oct 08 '22 08:10

Jukka K. Korpela