It appears, based on a urwid example that u'\N{HYPHEN BULLET}
will create a unicode character that is a hyphen intended for a bullet.
The names for unicode characters seem to be defined at fileformat.info and some element of using Unicode in Python appears in the howto documentation. Though there is no mention of the \N{}
syntax.
If you pull all these docs together you get the idea that the constant u"\N{HYPHEN BULLET}"
creates a ⁃
However, this is all a theory based on pulling all this data together. I can find no documentation for "\N{}
in the Python docs.
My question is whether my theory of operation is correct and whether it is documented anywhere?
Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).
Unicode names are officially in uppercase and in English, but are not case sensitive. The names may only use the letters A to Z, the digits 0 to 9, space, and hyphen.
Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.
Not every gory detail can be found in a how-to. The table of escape sequences in the reference manual includes:
Escape Sequence: \N{name}
Meaning: Character named name
in the Unicode database (Unicode only)
You are correct that u"\N{CHARACTER NAME}
produces a valid unicode character in Python.
It is not documented much in the Python docs, but after some searching I found a reference to it on effbot.org
http://effbot.org/librarybook/ucnhash.htm
The ucnhash module
(Implementation, 2.0 only) This module is an implementation module, which provides a name to character code mapping for Unicode string literals. If this module is present, you can use \N{} escapes to map Unicode character names to codes.
In Python 2.1, the functionality of this module was moved to the unicodedata module.
Checking the documentation for unicodedata
shows that the module is using the data from the Unicode Character Database.
unicodedata — Unicode Database
This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 9.0.0.
The full data can be found at: https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt
The data has the structure: HEXVALUE;CHARACTER NAME;etc..
so you could use this data to look up characters.
For example:
# 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
>>> u"\N{LATIN CAPITAL LETTER A}"
'A'
# FF7B;HALFWIDTH KATAKANA LETTER SA;Lo;0;L;<narrow> 30B5;;;;N;;;;;
>>> u"\N{HALFWIDTH KATAKANA LETTER SA}"
'サ'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With