Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange length of accent as "é" string return 2

Tags:

javascript

I have a strange problem that I can't explain. I'm trying to manipulate a string with an accent as "é". This string comes from the name of an image from an input file type.

What I can not understand is why my string when I parse with for the accented character is split into two character. Here is an example to better understand:

My é is divided into two character like this e & ́.

"é".length
=> 2

It's possible that utf8 is involved ?

I really don't understand anything at all !

like image 910
hypee Avatar asked Sep 02 '13 17:09

hypee


People also ask

How do you remove accents from strings?

We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.

How do I remove the accent from a string in Java?

Strip Accents from String Since Java 6, you can use the java. text. Normalizer class. This class contains the method normalize which transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text.

How do I change the accent of a character in Java?

string = string. replaceAll("\\p{M}", ""); For unicode, \\P{M} matches the base glyph and \\p{M} (lowercase) matches each accent.

How can I replace an accented character with a normal character in PHP?

php $transliterator = Transliterator::createFromRules(':: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;', Transliterator::FORWARD); $test = ['abcd', 'èe', '€', 'àòùìéëü', 'àòùìéëü', 'tiësto']; foreach($test as $e) { $normalized = $transliterator->transliterate($e); echo $e.


2 Answers

They are called Combining Diacritical Marks. They are a "piece" of Unicode... Some combinable diacritics that can be "chained" on any character. Clearly the length of the string in that case is 2 (because there is the e and the '. The precomposed characters like àéèìòù have been left for compatibility, but now any character can be accented :-) Clearly 99% of the programmers don't know it, and 99.9% of the programs support it very badly. I'm quite sure they could be used as an attack vector somewhere (but I'm not paranoid :-) )

I'll even add that even Skeet in 2009 wasn't sure on how they worked: http://codeblog.jonskeet.uk/2009/11/02/omg-ponies-aka-humanity-epic-fail/

You see, I couldn't remember whether combining characters came before or after base characters

:-) :-)

like image 180
xanatos Avatar answered Oct 02 '22 10:10

xanatos


Instead of UTF-8, it's more likely combining diacritical marks involved.

>>> "e\u0301"
"é"
>>> "e\u0301".length
2

Javascript's strings are usually encoded as UTF-16, so it could contain the whole single "é" (U+00e9) in 1 code unit.


But characters outside of the BMP (those with code point beyond U+FFFF) will return 2, as they are encoded into 2 UTF-16 code units.

>>> "😐".length
2
like image 23
kennytm Avatar answered Oct 02 '22 11:10

kennytm