Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Curious about the implementation of CaseInsensitiveComparator [duplicate]

Tags:

While I check the implementation of CaseInsensitiveComparator, which is private inner class of String, I found strange thing.

private static class CaseInsensitiveComparator
        implements Comparator<String>, java.io.Serializable {
    ...
    public int compare(String s1, String s2) {
        int n1 = s1.length();
        int n2 = s2.length();
        int min = Math.min(n1, n2);
        for (int i = 0; i < min; i++) {
            char c1 = s1.charAt(i);
            char c2 = s2.charAt(i);
            if (c1 != c2) {
                c1 = Character.toUpperCase(c1);
                c2 = Character.toUpperCase(c2);
                if (c1 != c2) {
                    c1 = Character.toLowerCase(c1);
                    c2 = Character.toLowerCase(c2);
                    if (c1 != c2) {
                        // No overflow because of numeric promotion
                        return c1 - c2;
                    }
                }
            }
        }
        return n1 - n2;
    }
    ...
}

What I'm curious is this: In the for loop, once you compare the upper cased characters, why you should compare the lower cased characters again? When Character.toUpperCase(c1) and Character.toUpperCase(c2) are different, is it possible that Character.toLowerCase(c1) and Character.toLowerCase(c2) are equal?

Couldn't it be simplified like this?

public int compare(String s1, String s2) {
    int n1 = s1.length();
    int n2 = s2.length();
    int min = Math.min(n1, n2);
    for (int i = 0; i < min; i++) {
        char c1 = s1.charAt(i);
        char c2 = s2.charAt(i);
        if (c1 != c2) {
            c1 = Character.toUpperCase(c1);
            c2 = Character.toUpperCase(c2);
            if (c1 != c2) {
                // No overflow because of numeric promotion
                return c1 - c2;
            }
        }
    }
    return n1 - n2;
}

Did I miss something?

like image 468
ntalbs Avatar asked Jul 29 '15 09:07

ntalbs


1 Answers

There are Unicode characters which are different in lowercase, but have the same uppercase form. For example the Greek letter Sigma - it has two lowercase forms (σ, and ς which is only used at the end of the word), but only one uppercase form (Σ).

I could not find any examples of the reverse, but if such a situation happened in the future, the current Java implementation is already prepared for this. Your version of the Comparator would definitely handle the Sigma case correctly.

You can find more information in the Case Mapping FAQ on the Unicode website.

like image 65
Glorfindel Avatar answered Oct 03 '22 03:10

Glorfindel