Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different behaviour on combining diacritics between String.Split() and String.IndexOf()

Tags:

c#

.net

unicode

If I have a string that contains combining diacritics, there seems to be some confusion between behaviour between different string functions. If I use String.IndexOf(), it will combine the diacritic and find the correct character. If I use String.Split(), for some reason it will not combine them and will not find the combined character.

Example code:

string test = "abce\u0308fgh";
Console.WriteLine(test.IndexOf("e"));
Console.WriteLine(test.IndexOf("ë"));

This will work as expected, meaning the e is not found, but the ë is. But for some reason this doesn't behave similarly:

string test = "abcde\u0308fgh";
Console.WriteLine(test.Split('e').Length.ToString());
Console.WriteLine(test.Split('ë').Length.ToString());

For some reason Split() will not combine the diacritic and will split by e, but not by ë.

Is there some reason for this functionality and is there a way to either have an IndexOf() function that doesn't combine the diacritic, or preferably a Split() function that does?

Edit: Noticed I had earlier written wrong code, it had 'e' and not "e"

string test = "abce\u0308fgh";
Console.WriteLine(test.IndexOf('e'));
Console.WriteLine(test.IndexOf('ë'));

This behaves as the Split() also, so it is not between the methods, it's between taking a character or a string.

like image 599
Sami Kuhmonen Avatar asked Apr 15 '15 22:04

Sami Kuhmonen


1 Answers

Actually, when I copy and paste your example code into a blank program, I get exactly the behavior I might expect: both IndexOf() and Split() do not treat the combined character as the passed in ë search character. I.e. the call to IndexOf('ë') returns -1 for me, consistent with how you describe the behavior of Split().

That said, if you want Split() to treat such two-character representations of single-character versions as if they were in fact originally the single-character version, you can just call string.Normalize() before Split(). For example:

Console.WriteLine(test.Normalize().Split('ë').Length);

The Normalize() method has an overload to let you control the exact type of normalization, should that be required (it's not in the example you've provided).

like image 95
Peter Duniho Avatar answered Nov 12 '22 16:11

Peter Duniho