Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python unicode normalization: is it correct to translate u'\xb4' to u' \u0301'

Tags:

python

unicode

look at the following snippet:

>>> import unicodedata
>>> from unicodedata import normalize, name

>>> normalize('NFKD', u'\xb4')
u' \u0301'

>>> normalize('NFKD', u'a\xb4a')
u'a \u0301a'

>>> normalize('NFKC', u'a\xb4a')
u'a \u0301a'

>>> name(u'\xb4'), name(u'\u0301')
('ACUTE ACCENT', 'COMBINING ACUTE ACCENT')

I am trying to understand if the behavior to translate u'\xb4' to u' \u0301' is correct. Why does it pad the combining acute accent with a space? Why does it translate the u\xb4 at all?

At fileformat we see that the ACUTE ACCENT used to be called SPACING ACUTE. I thought, it just meant that the cursor should move instead of waiting for the following character to be typed in.

UPD: in case someone is interested, here is a list if unicode characters that after NFKC normalization have a space in the beginning: http://pastebin.com/Z99r5AK9

like image 934
newtover Avatar asked Dec 19 '12 14:12

newtover


2 Answers

An accent character is the combination of a space and a combining accent character, as specified in the Unicode standard:

>>> import unicodedata
>>> unicodedata.decomposition(u'\xb4')
'<compat> 0020 0301'

The \u00B4 character has a somewhat ambiguous history, but the Unicode standard has decided to treat it as whitespace + accent, even though it has often been used as just a diacritic mark, see this discussion.

You could perhaps use \u02CA as an alternative; it is not treated as whitespace, and has no decomposition specified. It is instead qualified as a letter, so your mileage may vary.

like image 194
Martijn Pieters Avatar answered Oct 17 '22 08:10

Martijn Pieters


Take a look at the Unicode Collation Algorithm document. In particular, note that

Compatibility normalization (NFKC) folds stand-alone accents to a combination of space + combining accent.

like image 41
borrible Avatar answered Oct 17 '22 07:10

borrible