Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Case-insenstive string comparison strange behavior

This happens in both C# and Java so I think it's not a bug, just wonder why.

var s = "𐐁";
var lower = s.ToLower();
var upper = s.ToUpper();

if (!lower.Equals(upper, StringComparison.OrdinalIgnoreCase))
{
    //How can this happen?
}

According to this page, the lower case of "𐐁" is "𐐩", they should be the equal when comparing with IgnoreCase option. Why they are not equal?

like image 793
Cheng Chen Avatar asked Oct 12 '18 10:10

Cheng Chen


People also ask

How does case insensitive compare strings?

Comparing strings in a case insensitive manner means to compare them without taking care of the uppercase and lowercase letters. To perform this operation the most preferred method is to use either toUpperCase() or toLowerCase() function.

Is comparing strings case-sensitive?

operators differs from string comparison using the String. CompareTo and Compare(String, String) methods. They all perform a case-sensitive comparison.

How do I fix case-sensitive in python?

Case-Sensitive Names in Python To avoid problems with case-sensitive functions and variable names, use lowercase names with underscores between words for readability (e.g., user_name ) as stated in the official Python documentation.

What is Stringcomparison CurrentCultureIgnoreCase?

The StringComparer returned by the CurrentCultureIgnoreCase property can be used when strings are linguistically relevant but their case is not. For example, if strings are displayed to the user but case is unimportant, culture-sensitive, case-insensitive string comparison should be used to order the string data.


1 Answers

In defense of the Java API: the documentation of the method String.equalsIgnoreCase has never claimed that it would work "as expected" on arbitrary Unicode code points. It says:

Two characters c1 and c2 are considered the same ignoring case if at least one of the following is true:

  • The two characters are the same (as compared by the == operator)
  • Applying the method Character.toUpperCase(char) to each character produces the same result
  • Applying the method Character.toLowerCase(char) to each character produces the same result

So, the documentation states quite unambiguously that it applies Character.toUpperCase to chars, that is, to UTF-16 code units, not to Unicode code points.

If you use the method Character.toUpperCase(int codePoint) on every code point, then the comparison behaves as expected. Here is a short example in Scala (uses the exact same Java API, the higher-order forall method is hopefully self-explanatory):

val a = "𐐁"
val b = "𐐩"
(a.codePoints.toArray zip b.codePoints.toArray).forall { 
  case (x, y) => 
  Character.toLowerCase(x) == Character.toLowerCase(y) 
}

prints

true

as expected. Why this is? I think one can safely blame this one on backward compatibility.

like image 66
Andrey Tyukin Avatar answered Sep 30 '22 06:09

Andrey Tyukin