Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ä letter sorting problem in Java

Hi have some piece of code :

Collator col = Collator.getInstance(Locale.GERMAN);

List< String> list = new ArrayList<String>();
list.add("ac");
list.add("äb");
list.add("aa");
list.add("bb");


Collections.sort(list,col);
System.out.println(list);

I would expect to get [aa, ac, äb, bb] output, instead I am getting : [aa, äb, ac, bb]

I have no idea what I am doing wrong ... thanks in advance for help.


Hi thanks all for answers.

Unfortunately requirements of the project say clearly that strings must be sorted in such order : [aa, ac, äb, bb] : so I tried to use this code :

 String europeanRules =
        ("< a,A ; \u00e0,\u00c0 ; \u00e1,\u00c1 ; \u00e2,\u00c2 ; \u00e3,\u00c3; \u00e4,\u00c4 ; \u00e5,\u00c5 ; \u00e6,\u00c6 "+
                "; \u0101,\u0100 ; \u0103,\u0102 ; \u0105,\u0104 " +       
         "< b,B < c,C ; \u00e7,\u00c7 ; \u0107,\u0106 ; \u0109,\u0108 ; \u010b,\u010a ; \u010d,\u010c " +
         "< d,D ; \u010f,\u010e ; \u0111,\u0110 " +
         "< e,E ; \u00e8,\u00c8 ; \u00e9,\u00c9 ; \u00ea,\u00ca ; \u00eb,\u00cb " +
             "; \u0113,\u0112 ; \u0115,\u0114 ; \u0116,\u0117 ; \u0119,\u0118 ; \u011b,\u011a " +
         "< f,F < g,G < h,H " +
         "< i,I ; \u00ec,\u00cc ; \u00ed,\u00cd ; \u00ee,\u00ce ; \u00ef,\u00cf " +
         "< j,J < k,K " +
         "< l,L ; \u013a,\u0139 ; \u013c,\u013b ; \u013e,\u013d ; \u0140,\u013f ; \u0142,\u0141 " +
         "< m,M < n,N ; \u00f1,\u00d1 ; \u0144,\u0143 ; \u0146,\u0145 ; \u0148,\u0147 " +
         "< o,O ; \u00f2,\u00d2 ; \u00f3,\u00d3 ; \u00f4,\u00d4 ; \u00f5,\u00d5 ; \u00f6,\u00d6 ; \u00f8,\u00d8 " +
             "; \u014d,\u014c ; \u014f,\u014e ; \u0151,\u0150 " +
         "< p,P < q,Q < r,R ; \u0155,\u0154 ; \u0157,\u0156 ; \u0159,\u0158 " +
         "< s,S ; \u015b,\u015a ; \u015d,\u015c ; \u015f,\u015e ; \u0161,\u0160 " +
         "< t,T ; \u0163,\u0162 ; \u0165,\u0164 ; \u0167,\u0166 " +
         "< u,U ; \u00f9,\u00d9 ; \u00fa,\u00da ; \u00fb,\u00db ; \u00fc,\u00dc ; \u0169,\u0168 ; \u016b,\u016a ; \u016d,\u016c " +
             "; \u016f,\u016e ; \u0171,\u0170 ; \u0173,\u0172 " +
         "< v,V < w,W ; \u0175,\u0174 " +
         "< x,X < y,Y ; \u00fd,\u00dd ; \u00ff ; \u0177,\u0176 ; \u0178 " +
         "< z,Z ; \u017a,\u0179 ; \u017c,\u017b ; \u017e,\u017d");      

    RuleBasedCollator col = null;
    try {
        col = new RuleBasedCollator(europeanRules);
    } catch (ParseException e) {
    }   
    col.setStrength(Collator.SECONDARY);
    col.setDecomposition(Collator.FULL_DECOMPOSITION);

    List< String> list = new ArrayList<String>();
    list.add("ac");
    list.add("äb");
    list.add("aa");
    list.add("bb");     
    Collections.sort(list,col);
    System.out.println(list);

00E4 is UTF-8 code for ä so as I understand it should work ok ? Or I am doing something wrong ... thanks in advance for help.

like image 681
Grzegorz Avatar asked Jan 11 '11 22:01

Grzegorz


2 Answers

The order you get is correct, at least according to the Wikipedia entry for this subject (sorry in German, Google Translate might help you, although it corrupts the umlauts for me...)

like image 115
the.duckman Avatar answered Sep 21 '22 06:09

the.duckman


If you want your accented characters to always come after the normal ones, you can prepend an @ in your defined rule for the RuleBasedCollator.

The definitions of the rule elements is as follows:

[...]

Modifier: There are currently two modifiers that turn on special collation rules.

'@' : Turns on backwards sorting of accents (secondary differences), as in French.

'!' : Turns on Thai/Lao vowel-consonant swapping. If this rule is in force when a Thai vowel of the range \U0E40-\U0E44 precedes a Thai consonant of the range \U0E01-\U0E2E OR a Lao vowel of the range \U0EC0-\U0EC4 precedes a Lao consonant of the range \U0E81-\U0EAE then the vowel is placed after the consonant for collation purposes.

[...]

So your sample code would look like follows:

(I made the change only for the ä character, i.e. @\u00e4, @\u00c4)

 String europeanRules =
        ("< a,A ; \u00e0,\u00c0 ; \u00e1,\u00c1 ; \u00e2,\u00c2 ; \u00e3,\u00c3; @\u00e4,@\u00c4 ; \u00e5,\u00c5 ; \u00e6,\u00c6 "+
                "; \u0101,\u0100 ; \u0103,\u0102 ; \u0105,\u0104 " +       
         "< b,B < c,C ; \u00e7,\u00c7 ; \u0107,\u0106 ; \u0109,\u0108 ; \u010b,\u010a ; \u010d,\u010c " +
         "< d,D ; \u010f,\u010e ; \u0111,\u0110 " +
         "< e,E ; \u00e8,\u00c8 ; \u00e9,\u00c9 ; \u00ea,\u00ca ; \u00eb,\u00cb " +
             "; \u0113,\u0112 ; \u0115,\u0114 ; \u0116,\u0117 ; \u0119,\u0118 ; \u011b,\u011a " +
         "< f,F < g,G < h,H " +
         "< i,I ; \u00ec,\u00cc ; \u00ed,\u00cd ; \u00ee,\u00ce ; \u00ef,\u00cf " +
         "< j,J < k,K " +
         "< l,L ; \u013a,\u0139 ; \u013c,\u013b ; \u013e,\u013d ; \u0140,\u013f ; \u0142,\u0141 " +
         "< m,M < n,N ; \u00f1,\u00d1 ; \u0144,\u0143 ; \u0146,\u0145 ; \u0148,\u0147 " +
         "< o,O ; \u00f2,\u00d2 ; \u00f3,\u00d3 ; \u00f4,\u00d4 ; \u00f5,\u00d5 ; \u00f6,\u00d6 ; \u00f8,\u00d8 " +
             "; \u014d,\u014c ; \u014f,\u014e ; \u0151,\u0150 " +
         "< p,P < q,Q < r,R ; \u0155,\u0154 ; \u0157,\u0156 ; \u0159,\u0158 " +
         "< s,S ; \u015b,\u015a ; \u015d,\u015c ; \u015f,\u015e ; \u0161,\u0160 " +
         "< t,T ; \u0163,\u0162 ; \u0165,\u0164 ; \u0167,\u0166 " +
         "< u,U ; \u00f9,\u00d9 ; \u00fa,\u00da ; \u00fb,\u00db ; \u00fc,\u00dc ; \u0169,\u0168 ; \u016b,\u016a ; \u016d,\u016c " +
             "; \u016f,\u016e ; \u0171,\u0170 ; \u0173,\u0172 " +
         "< v,V < w,W ; \u0175,\u0174 " +
         "< x,X < y,Y ; \u00fd,\u00dd ; \u00ff ; \u0177,\u0176 ; \u0178 " +
         "< z,Z ; \u017a,\u0179 ; \u017c,\u017b ; \u017e,\u017d");      
    
    RuleBasedCollator col = null;
    try {
        col = new RuleBasedCollator(europeanRules);
    } catch (ParseException e) {
    }   
    col.setStrength(Collator.SECONDARY);
    col.setDecomposition(Collator.FULL_DECOMPOSITION);
    
    List< String> list = new ArrayList<String>();
    list.add("ac");
    list.add("äb");
    list.add("aa");
    list.add("af");
    list.add("bb");     
    Collections.sort(list,col);
    System.out.println(list);

The output is:

[aa, ac, af, äb, bb]
like image 34
MicSim Avatar answered Sep 23 '22 06:09

MicSim