Can anyone explain the following code from String.java
, specifically why there are three if
statements (which I've marked //1
, //2
and //3
)?
private static class CaseInsensitiveComparator
implements Comparator<String>, java.io.Serializable {
// use serialVersionUID from JDK 1.2.2 for interoperability
private static final long serialVersionUID = 8575799808933029326L;
public int compare(String s1, String s2) {
int n1=s1.length(), n2=s2.length();
for (int i1=0, i2=0; i1<n1 && i2<n2; i1++, i2++) {
char c1 = s1.charAt(i1);
char c2 = s2.charAt(i2);
if (c1 != c2) {/////////////////////////1
c1 = Character.toUpperCase(c1);
c2 = Character.toUpperCase(c2);
if (c1 != c2) {/////////////////////////2
c1 = Character.toLowerCase(c1);
c2 = Character.toLowerCase(c2);
if (c1 != c2) {/////////////////////////3
return c1 - c2;
}
}
}
}
return n1 - n2;
}
}
From Unicode Technical Standard:
In addition, because of the vagaries of natural language, there are situations where two different Unicode characters have the same uppercase or lowercase
So, it's not enough to compare only uppercase of two characters, because they may have different uppercase and same lowercase
Simple brute force check gives some results. Check for example code points 73 and 304:
char ch1 = (char) 73; //LATIN CAPITAL LETTER I
char ch2 = (char) 304; //LATIN CAPITAL LETTER I WITH DOT ABOVE
System.out.println(ch1==ch2);
System.out.println(Character.toUpperCase(ch1)==Character.toUpperCase(ch2));
System.out.println(Character.toLowerCase(ch1)==Character.toLowerCase(ch2));
Output:
false
false
true
So "İ" and "I" are not equal to each other. Both characters are uppercase. But they share the same lower case letter: "i" and that gives a reason to treat them as same values in case insensitive comparison.
Normally, we would expect to convert the case once and compare and be done with it. However, the code convert the case twice, and the reason is stated in the comment on a different method public boolean regionMatches(boolean ignoreCase, int toffset, String other, int ooffset, int len)
:
Unfortunately, conversion to uppercase does not work properly for the Georgian alphabet, which has strange rules about case conversion. So we need to make one last check before exiting.
The code of regionMatches
has a few difference from the code in the CaseInsenstiveComparator
, but essentially does the same thing. The full code of the method is quoted below for the purpose of cross-checking:
public boolean regionMatches(boolean ignoreCase, int toffset,
String other, int ooffset, int len) {
char ta[] = value;
int to = offset + toffset;
char pa[] = other.value;
int po = other.offset + ooffset;
// Note: toffset, ooffset, or len might be near -1>>>1.
if ((ooffset < 0) || (toffset < 0) || (toffset > (long)count - len) ||
(ooffset > (long)other.count - len)) {
return false;
}
while (len-- > 0) {
char c1 = ta[to++];
char c2 = pa[po++];
if (c1 == c2) {
continue;
}
if (ignoreCase) {
// If characters don't match but case may be ignored,
// try converting both characters to uppercase.
// If the results match, then the comparison scan should
// continue.
char u1 = Character.toUpperCase(c1);
char u2 = Character.toUpperCase(c2);
if (u1 == u2) {
continue;
}
// Unfortunately, conversion to uppercase does not work properly
// for the Georgian alphabet, which has strange rules about case
// conversion. So we need to make one last check before
// exiting.
if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
continue;
}
}
return false;
}
return true;
}
In another answer, default locale already gave an example for why comparing only uppercase is not enough, namely the ASCII letter "I" and the capital I with dot "İ".
Now you might wonder, why do they not just compare only lowercase instead of both upper and lower case, if it catches more cases than uppercase? The answer is, it does not catch more cases, it merely finds different cases.
Take the letter "ı" ((char)305
, small dotless i) and the ascii "i". They are different, their lowercase is different, but they share the same uppercase letter "I".
And finally, compare capital I with dot "İ" with small dotless i "ı". Neither their uppercases ("İ" vs. "I") nor their lowercases ("i" vs. "ı") matches, but the lowercase of their uppercase is the same ("I"). I found another case if this phenomenon, in the greek letters "ϴ" and "ϑ" (char 1012 and 977).
So a true case insensitive comparison can not even check uppercases and lowercases of the original characters, but must check the lowercases of the uppercases.
Consider the following characters: f
and F
. The initial if
statement would return false
because they don't match. However, if you capitalise both characters, you get F
and F
. Then they would match. The same would not be true of, say, c
and G
.
The code is efficient. There is no need to capitalise both characters if they already match (hence the first if
statement). However, if they don't match, we need to check if they differ only in case (hence the second if
statement).
The final if
statement is used for certain alphabets (such as Georgian) where capitalisation is a complicated affair. To be honest, I don't know a great deal about how that works (just trust that Oracle does!).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With