Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is length of "Níðhöggr" 9?

Why is the length function saying that this 8 character string is 9 characters?

>>> length "Níðhöggr"
9
like image 507
Dog Avatar asked May 27 '13 19:05

Dog


2 Answers

"Níðhöggr" contains 9 Unicode characters:

U+004E N (Lu): LATIN CAPITAL LETTER N 
U+00ED í (Ll): LATIN SMALL LETTER I WITH ACUTE
U+00F0 ð (Ll): LATIN SMALL LETTER ETH 
U+0068 h (Ll): LATIN SMALL LETTER H 
U+006F o (Ll): LATIN SMALL LETTER O 
U+0308 ̈ (Mn): COMBINING DIAERESIS 
U+0067 g (Ll): LATIN SMALL LETTER G 
U+0067 g (Ll): LATIN SMALL LETTER G 
U+0072 r (Ll): LATIN SMALL LETTER R 

You might want to use "Níðhöggr", which looks the same when printed out, but contains U+00F6 LATIN SMALL LETTER O WITH DIAERESIS instead of the two-character ö combo. In other words, it is in the composed normal form (NFC).

Or you might want "Níðhöggr", which has 10 Unicode characters (the í is split int i and a combining accent). That would be decomposed normal form (NFD).

Google "Unicode normalization" for interesting and/or hairy details. Use this function to normalize Unicode data in Haskell (thanks, Adam Rosenfield!).

like image 124
Petr Viktorin Avatar answered Nov 09 '22 16:11

Petr Viktorin


Because your isn't the single character ö (U+00F6 LATIN SMALL LETTER O WITH DIAERESIS); it's U+006F LATIN SMALL LETTER O plus U+0308 COMBINING DIAERESIS.

like image 24
Cairnarvon Avatar answered Nov 09 '22 17:11

Cairnarvon