Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collator compares strings weird

Tags:

java

sorting

I have a collection of strings and need to sort it. I'm using the Collator. But the output is weird.

final Collator collator = Collator.getInstance(Locale.US);

List<String> data = new ArrayList<String>();

data.add("1Z5800701_AB");
data.add("1Z5800701_AC");
data.add("1Z5800701-A");
data.add("1Z5800701 A");
data.add("1Z5800701B");
data.add("1Z5800701A");
data.add("1Z5800701 - A");

Collections.sort(data, new Comparator<String>() {

    @Override
    public int compare(String o1, String o2) {
        return collator.compare(o1, o2);
    }
});

for (String s : data) {
    System.out.println(s);
}

And the output is:

1Z5800701_AB
1Z5800701_AC
1Z5800701A
1Z5800701 A
1Z5800701 - A
1Z5800701-A
1Z5800701B

The last one string '1Z5800701B' should be after '1Z5800701A'. What am I missing here?

like image 785
Jan Krakora Avatar asked Mar 05 '13 17:03

Jan Krakora


1 Answers

It's a matter of the locale used, you can reproduce the same behavior in the bash shell with LC_ALL=en_US sort. The point is that the "word separators" are treated differently from "word characters" in this locale (i.e. you can't always say that character X sorts before or after character B - it depends on context). The result is if you have 1Z5800701 <optional separators> A, it sorts before 1Z5800701 <optional separators> B, that's why 1Z5800701B comes after all combinations where the A comes after the digits, optionally separated by "separators". You can also see some more examples of "not obvious" orderings in this Wikipedia articles

like image 189
Michał Kosmulski Avatar answered Oct 14 '22 08:10

Michał Kosmulski