This happens in both C# and Java so I think it's not a bug, just wonder why.
var s = "𐐁";
var lower = s.ToLower();
var upper = s.ToUpper();
if (!lower.Equals(upper, StringComparison.OrdinalIgnoreCase))
{
//How can this happen?
}
According to this page, the lower case of "𐐁" is "𐐩", they should be the equal when comparing with IgnoreCase
option. Why they are not equal?
Comparing strings in a case insensitive manner means to compare them without taking care of the uppercase and lowercase letters. To perform this operation the most preferred method is to use either toUpperCase() or toLowerCase() function.
operators differs from string comparison using the String. CompareTo and Compare(String, String) methods. They all perform a case-sensitive comparison.
Case-Sensitive Names in Python To avoid problems with case-sensitive functions and variable names, use lowercase names with underscores between words for readability (e.g., user_name ) as stated in the official Python documentation.
The StringComparer returned by the CurrentCultureIgnoreCase property can be used when strings are linguistically relevant but their case is not. For example, if strings are displayed to the user but case is unimportant, culture-sensitive, case-insensitive string comparison should be used to order the string data.
In defense of the Java API: the documentation of the method String.equalsIgnoreCase
has never claimed that it would work "as expected" on arbitrary Unicode code points. It says:
Two characters c1 and c2 are considered the same ignoring case if at least one of the following is true:
- The two characters are the same (as compared by the == operator)
- Applying the method Character.toUpperCase(char) to each character produces the same result
- Applying the method Character.toLowerCase(char) to each character produces the same result
So, the documentation states quite unambiguously that it applies Character.toUpperCase
to char
s, that is, to UTF-16 code units, not to Unicode code points.
If you use the method Character.toUpperCase(int codePoint)
on every code point, then the comparison behaves as expected. Here is a short example in Scala (uses the exact same Java API, the higher-order forall
method is hopefully self-explanatory):
val a = "𐐁"
val b = "𐐩"
(a.codePoints.toArray zip b.codePoints.toArray).forall {
case (x, y) =>
Character.toLowerCase(x) == Character.toLowerCase(y)
}
prints
true
as expected. Why this is? I think one can safely blame this one on backward compatibility.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With