Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I build a comparator that sorts Strings the same way that PostgreSQL does?

I'm writing an integration test that is passing a complex order by to PostgreSQL and then checking that the data comes back in the correct order. I'm writing this integration test in Java and its String.compareTo method appears to sort things differently from PostgreSQLs. I ran this on my PostgreSQL database:

SELECT regexp_split_to_table('D d a A c b', ' ') ORDER BY 1;

It responded with this:

a
A
b
c
d
D

I then created this unit test to compare that to the way Java sorts things:

import com.google.common.collect.Lists;
import com.google.common.collect.Ordering;
import org.junit.Test;

import java.util.List;

import static junit.framework.Assert.assertEquals;

public class PostgresqlSortOrderTest {

    @Test
    public void whenJavaSortsStringsThenItIsTheSameAsWhenPostgresqlSortsStrings() {
        List<String> postgresqlOrder = Lists.newArrayList("a", "A", "b", "c", "d", "D");
        Ordering<String> ordering = new Ordering<String>() {
            @Override
            public int compare(String left, String right) {

                return left.compareTo(right);
            }
        };
        List<String> javaOrdering = ordering.sortedCopy(postgresqlOrder);
        assertEquals(postgresqlOrder, javaOrdering);
    }

}

This failed with this output:

Expected :[a, A, b, c, d, D]  //postgresql
Actual   :[A, D, a, b, c, d]  //java

I'm very ignorant of the terminology here. I'd like to know the names of these different String sorts so I can communicate better. But more importantly, how can I make Java sort like PostgreSQL does?

like image 276
Daniel Kaplan Avatar asked Jan 06 '14 19:01

Daniel Kaplan


3 Answers

Late to show with an answer, but I'm afraid a simple case insensitive search isn't necessarily going to do what you want.

The keyword you want in your searches is collation (and in a wider sense locales) and PostgreSQL relies on the underlying operating-system to provide support for this. The ordering is rarely a simple character-by-character comparison. For example, in many locales spaces are ignored (that's certainly the case in en_GB).

Also, this means you can end up with different sort orders on different platforms (depending on whether Apple or Microsoft agree with Linus as to the default ordering for your country).

There has been some discussion as to whether it would make sense to include a BSD-licenced library to provide a consistent set of orderings across platforms. However, this is a lot of work and then means you can end up with different sorting inside your database from the rest of your operating-system. While different providers disagree on how to handle this, there's no one simple solution I'm afraid.

You might want to investigate the "C" collation for "traditional" sorting. I'm afraid I can't comment on Java's handling of proper locale sorting - not my field.

like image 99
Richard Huxton Avatar answered Oct 17 '22 17:10

Richard Huxton


Use comparator:String.CASE_INSENSITIVE_ORDER to sort any Collection of Strings. It is already implented in String class.

Look at the Field Summary at: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html

like image 25
Yoda Avatar answered Oct 17 '22 17:10

Yoda


Here is a comparator that matches the PostgreSQL ordering using the en_GB locale:

Comparator<String> comparator = (left, right) -> {
    Collator collator = Collator.getInstance(Locale.UK);
    collator.setStrength(Collator.PRIMARY);
    return collator.compare(left.replaceAll("\\p{Punct}", ""), right.replaceAll("\\p{Punct}", ""));
};
like image 1
David Keen Avatar answered Oct 17 '22 18:10

David Keen