Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting multi locale strings in Java

I'm trying to sort a List of objects by String field "country". Each country is in it's native language

  • Argentina
  • Australia
  • Österreich
  • Ελλάδα
  • България ...

What I want to do is to get "България" for instance, to appear after "A*" countries, as letter 'Б' corresponds to latin 'B'. I'm trying to use default Collater but non-latin names still end up last in list.

Here's my code so far:

private static final Comparator<DomainTO> DOMAIN_COUNTRY_COMPARATOR =
    new Comparator<DomainTO>() {
    @Override
    public int compare(DomainTO t, DomainTO t1) {
        Collator defaultCollator = Collator.getInstance();
        return defaultCollator.compare(t.getCountry(), t1.getCountry());
    }
};
like image 288
mkvcvc Avatar asked Nov 17 '10 10:11

mkvcvc


2 Answers

How to sort words from different languages? There are many alphabets (English, Russian, German etc). Everyone has ordered list of letters. It is easy to sort words coming from one alphabet. But is it possible to merge all these alphabets into one?
I think it is not possible to do it in a way that could be accepted by everyone. As an example take English and Russian alphabets. Russian letters can be casted to English letters (at least most of them) but after this casting they would change the order. This would be favoring one alphabet over another. Why not to cast English letters to Russian?
Another issue is that there are special letters. In German there is Ö between O and P and in Polish there is Ó in this place. So we have following relations:

O < Ö < P  
O < Ó < P

But what is the relation between Ö and Ó? If there was a country Ósterreich should it be befor or after Österreich? So there is impossible to define universal rules of sorting words from different languages.

All we can do is casting all alphabets to the chosen one. And this is what OP is trying to do.
The chosen one is Latin alphabet and other alphabets have to be casted to this one. The problem is that this casting is often ambiguous. Easily we can only cast most of Russian or Greek letters.
Much bigger problem is with Arabic or Asian languages. And we should remember that when casting from one alphabet to another we often lose something.

So how can we do such sorting?

  1. First proposition is to manually provide Latin name for every country. So we would have a list containing pairs such as
    • Россия Rossija
    • Ελλάδα Ellada
      Then we could sort by latin name and display names.
  2. Second approach is running code similar to this:

Code:

char [] russian = "АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщ".toCharArray();  
char [] russian_to = "AaBbWwGgDdEeEeZzZzIiJjKkLlMmNnOoPpRrSsTtUuFfHhCcCcSsss".toCharArray();  
for (int i = 0; i < russian.length; i++) {
    input = input.replace(russian[i], russian_to[i]);
}

This way we converted all letters from Russian alphabet. Now we have to add similar code for other alphabets. And Russian was the simplest one.
But assume that we succeeded and we managed to do such sorting of words from all languages of the world.
But what are the consequences of making such sorting? Before we answer this question lets ask what were the intentions of doing this. OP didn't say his reasons of doing such sorting. But we can deduce it:

  • Why do we sort elements?: To make them easier to find.
  • Why the names of countries are in native languages?: To make this list useful for those citizens of the world who know only native language.

So let's answer the question: Is this sorting makes it easier to find specific country to man who only knows his native language?

  1. If someone is from Austria then he assumes that Österreich will be after all countries starting with O. But after normalization Österreich will be Osterreich and will be somewhere between Ontario and Ottawa. (I know that Ontario and Ottawa are not countries. It's only example).
  2. If someone is from Japan and doesn't know Latin alphabet then this sorting would be useless for him. He would have to scan through all the list until he finds his country.
  3. Lets assume there is country Волгоград(Wolgograd) and there is citizen of this country who knows only Russian alphabet. In Russian alphabet В is third letter so this man would search at the beginning of the list (somewhere between Belgia and Danmark) when Волгоград would be near the end of the list (close to Venezuela). so in this case sorting will be not only not-helpful but also misleading.
  4. If someone knows Latin alphabet and is searching for his country then this can not be easy. When all countries are named in English and I am looking 'Poland' then I always know if I should go up or down the list. If I see 'Japan' I know to go down the list. When I see 'Russia' then I know to go up.
    But if we did sorting for all these names then there could be a problem. If I saw ايران then I would not be able to decide if go up or down the list. So in this example sorting is not helpful. Worse scenario is when I encounter Волгоград on the list. I don't know Russian alphabet and I would assume that I am near 'B' letter when in fact I am close to the end of the list. Then I would chose the wrong direction.

Summary:

Sorting country names written in different languages is difficult to define and implement. And when implemented it would be either not-helpful or harmful.

like image 119
rtruszk Avatar answered Oct 22 '22 01:10

rtruszk


Perhaps you can compare the normalized Strings. Something like this:

private static final Comparator<DomainTO> DOMAIN_COUNTRY_COMPARATOR =
    new Comparator<DomainTO>() {

        private String normalize(final String input) {
            return Normalizer
                .normalize(input, Normalizer.Form.NFD)
                .replaceAll("[^\\p{ASCII}]", "");
        }

        @Override
        public int compare(final DomainTO t, final DomainTO t1) {
            return normalize(t.getCountry()).compareTo(
                normalize(t1.getCountry()));
        }
    };

See related question about normalizing: Converting Java String to ascii (this question is linked to several similar questions)

like image 2
Sean Patrick Floyd Avatar answered Oct 22 '22 02:10

Sean Patrick Floyd