Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String.StartsWith not working with Asian languages?

Tags:

I've noticed this strange issue. Check out this Vietnamese (according to Google Translate) string:

string line = "Mìng-dĕ̤ng-ngṳ̄"; string sub = "Mìng-dĕ̤ng-ngṳ"; line.Length 15 sub.Length 14 line.StartsWith(sub) false 

Which seems to me like a wrong result. So, I've implemented my custom StartWith function, which compares the string char-by-char.

public bool CustomStartWith(string parent, string child) {     for (int i = 0; i < child.Length; i++)     {         if (parent[i] != child[i])             return false;     }     return true; } 

And as I assumed, the results of running this function

CustomStartWith("Mìng-dĕ̤ng-ngṳ̄", "Mìng-dĕ̤ng-ngṳ") true 

What's going on here?! How's this possible?

like image 609
No1Lives4Ever Avatar asked Jan 12 '16 10:01

No1Lives4Ever


People also ask

Is startsWith in C# case sensitive?

The StartsWith method is called several times using case sensitivity, case insensitivity, and different cultures that influence the results of the search.

What is the purpose of startsWith () in JavaScript?

The startsWith() method determines whether a string begins with the characters of a specified string, returning true or false as appropriate.

How do you use string Startwith?

JavaScript String startsWith()The startsWith() method returns true if a string starts with a specified string. Otherwise it returns false . The startsWith() method is case sensitive. See also the endsWith() method.

How do you check if a string starts with a character?

Java String startsWith() MethodThe startsWith() method checks whether a string starts with the specified character(s). Tip: Use the endsWith() method to check whether a string ends with the specified character(s).


2 Answers

The result returned by StartsWith is correct. By default, most string comparison methods perform culture-sensitive comparisons using the current culture, not plain byte sequences. Although your line starts with a byte sequence identical to sub, the substring it represents is not equivalent under most (or all) cultures.

If you really want a comparison that treats strings as plain byte sequences, use the overload:

line.StartsWith(sub, StringComparison.Ordinal);                       // true 

If you want the comparison to be case-insensitive:

line.StartsWith(sub, StringComparison.OrdinalIgnoreCase);             // true 

Here's a more familiar example:

var line1 = "café";   // 63 61 66 E9     – precomposed character 'é' (U+00E9) var line2 = "café";   // 63 61 66 65 301 – base letter e (U+0065) and                       //                   combining acute accent (U+0301) var sub   = "cafe";   // 63 61 66 65  Console.WriteLine(line1.StartsWith(sub));                             // false Console.WriteLine(line2.StartsWith(sub));                             // false Console.WriteLine(line1.StartsWith(sub, StringComparison.Ordinal));   // false Console.WriteLine(line2.StartsWith(sub, StringComparison.Ordinal));   // true 

In the above examples, line2 starts with the same byte sequence as sub, followed by a combining acute accent (U+0301) to be applied to the final e. line1 uses the precomposed character for é (U+00E9), so its byte sequence does not match that of sub.

In real-world semantics, one would typically not consider cafe to be a substring of café; the e and are treated as distinct characters. That happens to be represented as a pair of characters starting with e is an internal implementation detail of the encoding scheme (Unicode) that should not affect results. This is demonstrated by the above example contrasting café and café; one would not expect different results unless specifically intending an ordinal (byte-by-byte) comparison.

Adapting this explanation to your example:

string line = "Mìng-dĕ̤ng-ngṳ̄";   // 4D EC 6E 67 2D 64 115 324 6E 67 2D 6E 67 1E73 304 string sub  = "Mìng-dĕ̤ng-ngṳ";   // 4D EC 6E 67 2D 64 115 324 6E 67 2D 6E 67 1E73 

Each .NET character represents a UTF-16 code unit, whose values are shown in the comments above. The first 14 code units are identical, which is why your char-by-char comparison evaluates to true (just like StringComparison.Ordinal). However, the 15th code unit in line is the combining macron, ◌̄ (U+0304), which combines with its preceding (U+1E73) to give ṳ̄.

like image 98
Douglas Avatar answered Oct 28 '22 05:10

Douglas


This is not a bug. The String.StartsWith is in fact much smarter than just a character-by-character check of your two strings. It takes into account your current culture (language settings, etc.) and it takes into account contractions and special characters. (It does not care you need two characters to end up with ṳ̄. It compares it as one).

So this means that if you don't want to take all those culture specific settings, and just want to check it using ordinal comparison, you have to tell the comparer that.

This is the correct way to do that (not ignoring the case, like Douglas did!):

line.StartsWith(sub, StringComparison.Ordinal); 
like image 30
Patrick Hofman Avatar answered Oct 28 '22 05:10

Patrick Hofman