Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Details of Unicode Names \N Documented? [duplicate]

It appears, based on a urwid example that u'\N{HYPHEN BULLET} will create a unicode character that is a hyphen intended for a bullet.

The names for unicode characters seem to be defined at fileformat.info and some element of using Unicode in Python appears in the howto documentation. Though there is no mention of the \N{} syntax.

If you pull all these docs together you get the idea that the constant u"\N{HYPHEN BULLET}" creates a ⁃

However, this is all a theory based on pulling all this data together. I can find no documentation for "\N{} in the Python docs.

My question is whether my theory of operation is correct and whether it is documented anywhere?

like image 677
Ray Salemi Avatar asked Nov 29 '17 14:11

Ray Salemi


People also ask

What is a Unicode character example?

Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).

What is Unicode name?

Unicode names are officially in uppercase and in English, but are not case sensitive. The names may only use the letters A to Z, the digits 0 to 9, space, and hyphen.

What is a Unicode text?

Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.


2 Answers

Not every gory detail can be found in a how-to. The table of escape sequences in the reference manual includes:

Escape Sequence: \N{name}
Meaning: Character named name in the Unicode database (Unicode only)

like image 110
Stop harming Monica Avatar answered Oct 12 '22 11:10

Stop harming Monica


You are correct that u"\N{CHARACTER NAME} produces a valid unicode character in Python.

It is not documented much in the Python docs, but after some searching I found a reference to it on effbot.org

http://effbot.org/librarybook/ucnhash.htm

The ucnhash module

(Implementation, 2.0 only) This module is an implementation module, which provides a name to character code mapping for Unicode string literals. If this module is present, you can use \N{} escapes to map Unicode character names to codes.

In Python 2.1, the functionality of this module was moved to the unicodedata module.

Checking the documentation for unicodedata shows that the module is using the data from the Unicode Character Database.

unicodedata — Unicode Database

This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 9.0.0.

The full data can be found at: https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt

The data has the structure: HEXVALUE;CHARACTER NAME;etc.. so you could use this data to look up characters.

For example:

# 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
>>> u"\N{LATIN CAPITAL LETTER A}"
'A'

# FF7B;HALFWIDTH KATAKANA LETTER SA;Lo;0;L;<narrow> 30B5;;;;N;;;;;
>>> u"\N{HALFWIDTH KATAKANA LETTER SA}"
'サ'
like image 37
alxwrd Avatar answered Oct 12 '22 09:10

alxwrd