I've noticed this strange issue. Check out this Vietnamese (according to Google Translate) string:
string line = "Mìng-dĕ̤ng-ngṳ̄"; string sub = "Mìng-dĕ̤ng-ngṳ"; line.Length 15 sub.Length 14 line.StartsWith(sub) false
Which seems to me like a wrong result. So, I've implemented my custom StartWith
function, which compares the string char-by-char.
public bool CustomStartWith(string parent, string child) { for (int i = 0; i < child.Length; i++) { if (parent[i] != child[i]) return false; } return true; }
And as I assumed, the results of running this function
CustomStartWith("Mìng-dĕ̤ng-ngṳ̄", "Mìng-dĕ̤ng-ngṳ") true
What's going on here?! How's this possible?
The StartsWith method is called several times using case sensitivity, case insensitivity, and different cultures that influence the results of the search.
The startsWith() method determines whether a string begins with the characters of a specified string, returning true or false as appropriate.
JavaScript String startsWith()The startsWith() method returns true if a string starts with a specified string. Otherwise it returns false . The startsWith() method is case sensitive. See also the endsWith() method.
Java String startsWith() MethodThe startsWith() method checks whether a string starts with the specified character(s). Tip: Use the endsWith() method to check whether a string ends with the specified character(s).
The result returned by StartsWith
is correct. By default, most string comparison methods perform culture-sensitive comparisons using the current culture, not plain byte sequences. Although your line
starts with a byte sequence identical to sub
, the substring it represents is not equivalent under most (or all) cultures.
If you really want a comparison that treats strings as plain byte sequences, use the overload:
line.StartsWith(sub, StringComparison.Ordinal); // true
If you want the comparison to be case-insensitive:
line.StartsWith(sub, StringComparison.OrdinalIgnoreCase); // true
Here's a more familiar example:
var line1 = "café"; // 63 61 66 E9 – precomposed character 'é' (U+00E9) var line2 = "café"; // 63 61 66 65 301 – base letter e (U+0065) and // combining acute accent (U+0301) var sub = "cafe"; // 63 61 66 65 Console.WriteLine(line1.StartsWith(sub)); // false Console.WriteLine(line2.StartsWith(sub)); // false Console.WriteLine(line1.StartsWith(sub, StringComparison.Ordinal)); // false Console.WriteLine(line2.StartsWith(sub, StringComparison.Ordinal)); // true
In the above examples, line2
starts with the same byte sequence as sub
, followed by a combining acute accent (U+0301) to be applied to the final e
. line1
uses the precomposed character for é
(U+00E9), so its byte sequence does not match that of sub
.
In real-world semantics, one would typically not consider cafe
to be a substring of café
; the e
and é
are treated as distinct characters. That é
happens to be represented as a pair of characters starting with e
is an internal implementation detail of the encoding scheme (Unicode) that should not affect results. This is demonstrated by the above example contrasting café
and café
; one would not expect different results unless specifically intending an ordinal (byte-by-byte) comparison.
Adapting this explanation to your example:
string line = "Mìng-dĕ̤ng-ngṳ̄"; // 4D EC 6E 67 2D 64 115 324 6E 67 2D 6E 67 1E73 304 string sub = "Mìng-dĕ̤ng-ngṳ"; // 4D EC 6E 67 2D 64 115 324 6E 67 2D 6E 67 1E73
Each .NET character represents a UTF-16 code unit, whose values are shown in the comments above. The first 14 code units are identical, which is why your char-by-char comparison evaluates to true (just like StringComparison.Ordinal
). However, the 15th code unit in line
is the combining macron, ◌̄ (U+0304), which combines with its preceding ṳ
(U+1E73) to give ṳ̄
.
This is not a bug. The String.StartsWith
is in fact much smarter than just a character-by-character check of your two strings. It takes into account your current culture (language settings, etc.) and it takes into account contractions and special characters. (It does not care you need two characters to end up with ṳ̄
. It compares it as one).
So this means that if you don't want to take all those culture specific settings, and just want to check it using ordinal comparison, you have to tell the comparer that.
This is the correct way to do that (not ignoring the case, like Douglas did!):
line.StartsWith(sub, StringComparison.Ordinal);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With