I'd like to create an array of the Unicode code points which constitute white space in JavaScript (minus the Unicode-white-space code points, which I address separately). These characters are horizontal tab, vertical tab, form feed, space, non-breaking space, and BOM. I could do this with magic numbers:
whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]
That's a little bit obscure; names would be better. The unicodedata.lookup
method passed through ord
helps some:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
But this doesn't work for 0x9, 0xb, or 0xc -- I think because they're control characters, and the "names" FORM FEED and such are just alias names. Is there any way to map these "names" to the characters, or their code points, in standard Python? Or am I out of luck?
Kerrek SB's comment is a good one: just put the names in a comment.
BTW, Python also supports a named unicode literal:
>>> u"\N{NO-BREAK SPACE}"
u'\xa0'
But it uses the same unicode name database, and the control characters are not in it.
You could roll your own "database" for the control characters by parsing a few lines of the UCD files in the Unicode public directory. In particular, see the UnicodeData-6.1.0d3 file (or see the parent directory for earlier versions).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With