Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding logic in CaseInsensitiveComparator

Can anyone explain the following code from String.java, specifically why there are three if statements (which I've marked //1, //2 and //3)?

private static class CaseInsensitiveComparator
                     implements Comparator<String>, java.io.Serializable {
// use serialVersionUID from JDK 1.2.2 for interoperability
private static final long serialVersionUID = 8575799808933029326L;

    public int compare(String s1, String s2) {
        int n1=s1.length(), n2=s2.length();
        for (int i1=0, i2=0; i1<n1 && i2<n2; i1++, i2++) {
            char c1 = s1.charAt(i1);
            char c2 = s2.charAt(i2);
            if (c1 != c2) {/////////////////////////1
                c1 = Character.toUpperCase(c1);
                c2 = Character.toUpperCase(c2);
                if (c1 != c2) {/////////////////////////2
                    c1 = Character.toLowerCase(c1);
                    c2 = Character.toLowerCase(c2);
                    if (c1 != c2) {/////////////////////////3
                        return c1 - c2;
                    }
                }
            }
        }
        return n1 - n2;
    }
}
like image 548
Kevin Avatar asked Mar 20 '13 08:03

Kevin


4 Answers

From Unicode Technical Standard:

In addition, because of the vagaries of natural language, there are situations where two different Unicode characters have the same uppercase or lowercase

So, it's not enough to compare only uppercase of two characters, because they may have different uppercase and same lowercase

Simple brute force check gives some results. Check for example code points 73 and 304:

char ch1 = (char) 73; //LATIN CAPITAL LETTER I
char ch2 = (char) 304; //LATIN CAPITAL LETTER I WITH DOT ABOVE
System.out.println(ch1==ch2);
System.out.println(Character.toUpperCase(ch1)==Character.toUpperCase(ch2));
System.out.println(Character.toLowerCase(ch1)==Character.toLowerCase(ch2));

Output:

false
false
true

So "İ" and "I" are not equal to each other. Both characters are uppercase. But they share the same lower case letter: "i" and that gives a reason to treat them as same values in case insensitive comparison.

like image 139
default locale Avatar answered Nov 11 '22 12:11

default locale


Normally, we would expect to convert the case once and compare and be done with it. However, the code convert the case twice, and the reason is stated in the comment on a different method public boolean regionMatches(boolean ignoreCase, int toffset, String other, int ooffset, int len):

Unfortunately, conversion to uppercase does not work properly for the Georgian alphabet, which has strange rules about case conversion. So we need to make one last check before exiting.


Appendix

The code of regionMatches has a few difference from the code in the CaseInsenstiveComparator, but essentially does the same thing. The full code of the method is quoted below for the purpose of cross-checking:

public boolean regionMatches(boolean ignoreCase, int toffset,
                       String other, int ooffset, int len) {
    char ta[] = value;
    int to = offset + toffset;
    char pa[] = other.value;
    int po = other.offset + ooffset;
    // Note: toffset, ooffset, or len might be near -1>>>1.
    if ((ooffset < 0) || (toffset < 0) || (toffset > (long)count - len) ||
            (ooffset > (long)other.count - len)) {
        return false;
    }
    while (len-- > 0) {
        char c1 = ta[to++];
        char c2 = pa[po++];
        if (c1 == c2) {
            continue;
        }
        if (ignoreCase) {
            // If characters don't match but case may be ignored,
            // try converting both characters to uppercase.
            // If the results match, then the comparison scan should
            // continue.
            char u1 = Character.toUpperCase(c1);
            char u2 = Character.toUpperCase(c2);
            if (u1 == u2) {
                continue;
            }
            // Unfortunately, conversion to uppercase does not work properly
            // for the Georgian alphabet, which has strange rules about case
            // conversion.  So we need to make one last check before
            // exiting.
            if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
                continue;
            }
        }
        return false;
    }
    return true;
}
like image 43
nhahtdh Avatar answered Nov 11 '22 11:11

nhahtdh


In another answer, default locale already gave an example for why comparing only uppercase is not enough, namely the ASCII letter "I" and the capital I with dot "İ".

Now you might wonder, why do they not just compare only lowercase instead of both upper and lower case, if it catches more cases than uppercase? The answer is, it does not catch more cases, it merely finds different cases.

Take the letter "ı" ((char)305, small dotless i) and the ascii "i". They are different, their lowercase is different, but they share the same uppercase letter "I".

And finally, compare capital I with dot "İ" with small dotless i "ı". Neither their uppercases ("İ" vs. "I") nor their lowercases ("i" vs. "ı") matches, but the lowercase of their uppercase is the same ("I"). I found another case if this phenomenon, in the greek letters "ϴ" and "ϑ" (char 1012 and 977).

So a true case insensitive comparison can not even check uppercases and lowercases of the original characters, but must check the lowercases of the uppercases.

like image 22
Christian Semrau Avatar answered Nov 11 '22 11:11

Christian Semrau


Consider the following characters: f and F. The initial if statement would return false because they don't match. However, if you capitalise both characters, you get F and F. Then they would match. The same would not be true of, say, c and G.

The code is efficient. There is no need to capitalise both characters if they already match (hence the first if statement). However, if they don't match, we need to check if they differ only in case (hence the second if statement).

The final if statement is used for certain alphabets (such as Georgian) where capitalisation is a complicated affair. To be honest, I don't know a great deal about how that works (just trust that Oracle does!).

like image 5
Duncan Jones Avatar answered Nov 11 '22 12:11

Duncan Jones