Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I determine a Unicode character from its name in Python, even if that character is a control character?

Tags:

python

unicode

I'd like to create an array of the Unicode code points which constitute white space in JavaScript (minus the Unicode-white-space code points, which I address separately). These characters are horizontal tab, vertical tab, form feed, space, non-breaking space, and BOM. I could do this with magic numbers:

whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]

That's a little bit obscure; names would be better. The unicodedata.lookup method passed through ord helps some:

>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160

But this doesn't work for 0x9, 0xb, or 0xc -- I think because they're control characters, and the "names" FORM FEED and such are just alias names. Is there any way to map these "names" to the characters, or their code points, in standard Python? Or am I out of luck?

like image 396
Jeff Walden Avatar asked Jul 05 '11 22:07

Jeff Walden


2 Answers

Kerrek SB's comment is a good one: just put the names in a comment.

BTW, Python also supports a named unicode literal:

>>> u"\N{NO-BREAK SPACE}"
u'\xa0'

But it uses the same unicode name database, and the control characters are not in it.

like image 150
Ned Batchelder Avatar answered Oct 21 '22 23:10

Ned Batchelder


You could roll your own "database" for the control characters by parsing a few lines of the UCD files in the Unicode public directory. In particular, see the UnicodeData-6.1.0d3 file (or see the parent directory for earlier versions).

like image 23
ars Avatar answered Oct 21 '22 23:10

ars