look at the following snippet:
>>> import unicodedata
>>> from unicodedata import normalize, name
>>> normalize('NFKD', u'\xb4')
u' \u0301'
>>> normalize('NFKD', u'a\xb4a')
u'a \u0301a'
>>> normalize('NFKC', u'a\xb4a')
u'a \u0301a'
>>> name(u'\xb4'), name(u'\u0301')
('ACUTE ACCENT', 'COMBINING ACUTE ACCENT')
I am trying to understand if the behavior to translate u'\xb4'
to u' \u0301'
is correct. Why does it pad the combining acute accent with a space? Why does it translate the u\xb4
at all?
At fileformat we see that the ACUTE ACCENT
used to be called SPACING ACUTE
. I thought, it just meant that the cursor should move instead of waiting for the following character to be typed in.
UPD: in case someone is interested, here is a list if unicode characters that after NFKC normalization have a space in the beginning: http://pastebin.com/Z99r5AK9
An accent character is the combination of a space and a combining accent character, as specified in the Unicode standard:
>>> import unicodedata
>>> unicodedata.decomposition(u'\xb4')
'<compat> 0020 0301'
The \u00B4
character has a somewhat ambiguous history, but the Unicode standard has decided to treat it as whitespace + accent, even though it has often been used as just a diacritic mark, see this discussion.
You could perhaps use \u02CA
as an alternative; it is not treated as whitespace, and has no decomposition specified. It is instead qualified as a letter, so your mileage may vary.
Take a look at the Unicode Collation Algorithm document. In particular, note that
Compatibility normalization (NFKC) folds stand-alone accents to a combination of space + combining accent.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With