Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

IndexOf and ordinal string comparisons

My problem is that String.IndexOf returns -1. I would expect it to return 0.

The parameters:

text = C:\\Users\\User\\Desktop\\Sync\\̼ (note the Combining Seagull Below character)

stringToTrim = C:\\Users\\User\\Desktop\\Sync\\

When I check for the index, using int index = text.IndexOf(stringToTrim);, the value of index is -1. I found that using an ordinal string comparison solved this problem of mine:

int index = text.IndexOf(stringToTrim, StringComparison.Ordinal);

Reading online, a lot of Unicode characters (like U+00B5 and U+03BC) map to the same symbol, so it would be a good idea to expand on this and normalize both strings:

int index = text.Normalize(NormalizationForm.FormKD).IndexOf(stringToTrim.Normalize(NormalizationForm.FormKD), StringComparison.Ordinal);

Is this the correct approach to check at what index one string contains all sequential characters of another string? So the idea is, you normalize when you want to check that symbols are a match, but you don't normalize when you want to check characters by their encoded values (allow duplicate symbols, therefore)? Also, could someone please explain why int index = text.IndexOf(stringToTrim); did not find a match at the start of the string? In other words, what is it actually doing under the covers? I would have expected it to start searching characters from the beginning of the string to the end of the string.

like image 491
Alexandru Avatar asked Dec 15 '14 20:12

Alexandru


2 Answers

The behavior makes perfect sense to me. You are using a combining character, which is combined with the preceding character, turning it into a different character, one which won't match the '\\' character you've specified at the end of your search string. That prevents the entire string you're looking for from being found. If you looked for "C:\\Users\\User\\Desktop\\Sync" instead, it would have found it.

Using StringComparison.Ordinal tells .NET to ignore the various rules for characters and look only at their exact ordinal value. This seems to do what you wanted, so yes…that's what you should do.

The "correct approach" depends entirely on what behavior you want. A lot of string manipulation involves text being presented to or provided by the user and should be done in a culture-aware and Unicode-aware way. Other times, that isn't desirable. It's important to select the right approach for your needs.

like image 74
Peter Duniho Avatar answered Sep 30 '22 19:09

Peter Duniho


Yes, you should use StringComparison.Ordinal to guarantee the culture is ignored when comparing the value. It is necessary especially for all the strings that are consider to be culture invariant "by default". That includes file paths.

When not using StringComparison.Ordinal) it is possible to introduce subtle bugs: http://msdn.microsoft.com/en-us/library/dd465121(v=vs.110).aspx

When culturally independent string data, such as XML tags, HTML tags, user names, file paths, and the names of system objects, are interpreted as if they were culture-sensitive, application code can be subject to subtle bugs, poor performance, and, in some cases, security issues.

Some side benefit of StringComparison.Ordinal is better performance: http://msdn.microsoft.com/en-us/library/ms973919.aspx

like image 23
PiotrWolkowski Avatar answered Sep 30 '22 21:09

PiotrWolkowski