Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't unicodedata recognise certain characters?

Tags:

python

unicode

In Python 2.7 at least, unicodedata.name() doesn't recognise certain characters.

>>> from unicodedata import name
>>> name(u'\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
>>> name(u'a')
'LATIN SMALL LETTER A'

Certainly Unicode contains the character \n, and it has a name, specifically "LINE FEED".

NB. unicodedata.lookup('LINE FEED') and unicodedata.lookup(u'LINE FEED') both give a KeyError: undefined character name.

like image 340
Hammerite Avatar asked Jul 03 '14 11:07

Hammerite


1 Answers

The unicodedata.name() lookup relies on column 2 of the UnicodeData.txt database in the standard (Python 2.7 uses Unicode 5.2.0).

If that name starts with < it is ignored. All control codes, including newlines, are in that category; the first column has no name other than <control>:

000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;

Column 10 is the old, Unicode 1.0 name, and should not be used, according to the standard. In other words, \n has no name, other than the generic <control>, which the Python database ignores (as it is not unique).

Python 3.3 added support for NameAliases.txt, which lets you look up names by alias; so lookup('LINE FEED'), lookup('new line') or lookup('eol'), etc, all reference \n. However, the unicodedata.name() method does not support aliases, nor could it (which would it pick?):

  • Added support for Unicode name aliases and named sequences. Both unicodedata.lookup() and '\N{...}' now resolve name aliases, and unicodedata.lookup() resolves named sequences too.

TL;DR: LINE FEED is not the official name for \n, it is but an alias for it. Python 3.3 and up let you look up characters by alias.

like image 167
Martijn Pieters Avatar answered Oct 06 '22 14:10

Martijn Pieters