How can I determine a Unicode character from its name in Python, even if that character is a control character?

Question

I'd like to create an array of the Unicode code points which constitute white space in JavaScript (minus the Unicode-white-space code points, which I address separately). These characters are horizontal tab, vertical tab, form feed, space, non-breaking space, and BOM. I could do this with magic numbers:

whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]

That's a little bit obscure; names would be better. The unicodedata.lookup method passed through ord helps some:

>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160

But this doesn't work for 0x9, 0xb, or 0xc -- I think because they're control characters, and the "names" FORM FEED and such are just alias names. Is there any way to map these "names" to the characters, or their code points, in standard Python? Or am I out of luck?

Ned Batchelder · Accepted Answer

Kerrek SB's comment is a good one: just put the names in a comment.

BTW, Python also supports a named unicode literal:

>>> u"\N{NO-BREAK SPACE}"
u'\xa0'

But it uses the same unicode name database, and the control characters are not in it.

ars · Answer

You could roll your own "database" for the control characters by parsing a few lines of the UCD files in the Unicode public directory. In particular, see the UnicodeData-6.1.0d3 file (or see the parent directory for earlier versions).

How can I determine a Unicode character from its name in Python, even if that character is a control character?

Tags:

python

unicode

Jeff Walden

2 Answers

Ned Batchelder

ars

Recent Activity

Donate For Us

How can I determine a Unicode character from its name in Python, even if that character is a control character?

Tags:

python

unicode

Jeff Walden

2 Answers

Ned Batchelder

ars

Related questions

Recent Activity

Donate For Us