Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort Macedonian alphabet using collation

Tags:

java

collation

I'm trying to sort a set of strings written in Macedonian alphabet. I know how to do it, but the end result isn't what I expected. Here is my test program:

public class Main {

    private static final char[] ALPHABET_ARRAY = {
        'а', 'б', 'в', 'г', 'д', 'ѓ', 'е', 'ж', 'з', 'ѕ', 'и', 'ј', 'к', 'л', 'љ', 'м', 'н', 'њ', 'о', 'п', 'р', 'с', 'т', 'ќ', 'у', 'ф', 'х', 'ц', 'ч','џ', 'ш' };

    public static void main(String[] args) {
        Collator collator = Collator.getInstance(new Locale("mk", "MK"));
        List<String> list = new LinkedList<>();
        for (int i = 0; i < ALPHABET_ARRAY.length; i++) {
            list.add("" + ALPHABET_ARRAY[i]);
        }
        list.sort(collator::compare);
        list.forEach(System.out::print);
    }
}

The letters in ALPHABET_ARRAY are in the correct alphabetical order, but the program prints

абвгѓдежзѕијкќлљмнњопрстуфхцчџш

But it should have been:

абвгдѓежзѕијклљмнњопрстќуфхцчџш

Is there a problem with the Macedonian collator in Java or am I doing something wrong?

like image 828
David ten Hove Avatar asked Jun 04 '16 10:06

David ten Hove


1 Answers

Collator for "mk_MK" locale is based on the sun.text.resources.mk.CollationData_mk resource (CollationData_mk.java source in jdk8u repo tagged jdk8u92-b14).

The collator rules in CollationData_mk clearly place 'ѓ' right after 'г' and 'ќ' right after 'к'.

As it is possible to create RuleBasedCollator with custom rules, the simplest way to get the sorting order that you need is to modify the rules from CollationData_mk a little:

public static Collator createMacedonianCollator() throws ParseException {
    // the defaults are defined in non-public sun.util.locale.provider.CollationRules
    // they are used internally in sun.util.locale.provider.CollatorProviderImpl
    // we have no direct access to proper defaults, so we will simply comment entries which depend on them
    String DEFAULTRULES = "";
    // we will move the entries for ѓ and ќ only, leaving everything else as is
    return new RuleBasedCollator( DEFAULTRULES +
            //"& 9 < \u0482 " +       // thousand sign
            //"& Z " +                // Arabic script sorts after Z's
            "< \u0430 , \u0410" +   // a
            "< \u0431 , \u0411" +   // be
            "< \u0432 , \u0412" +   // ve
            "< \u0433 , \u0413" +   // ghe
            "; \u0491 , \u0490" +   // ghe-upturn
            "; \u0495 , \u0494" +   // ghe-mid-hook
            /*!!!moved after д/de!!!*/ //"; \u0453 , \u0403" +   // gje
            "; \u0493 , \u0492" +   // ghe-stroke
            "< \u0434 , \u0414" +   // de
            /*!!!moved AND relation strength changed!!!*/ "< \u0453 , \u0403" +   // gje
            "< \u0452 , \u0402" +   // dje
            "< \u0435 , \u0415" +   // ie
            "; \u04bd , \u04bc" +   // che
            "; \u0451 , \u0401" +   // io
            "; \u04bf , \u04be" +   // che-descender
            "< \u0454 , \u0404" +   // uk ie
            "< \u0436 , \u0416" +   // zhe
            "; \u0497 , \u0496" +   // zhe-descender
            "; \u04c2 , \u04c1" +   // zhe-breve
            "< \u0437 , \u0417" +   // ze
            "; \u0499 , \u0498" +   // zh-descender
            "< \u0455 , \u0405" +   // dze
            "< \u0438 , \u0418" +   // i
            "< \u0456 , \u0406" +   // uk/bg i
            "; \u04c0 " +           // palochka
            "< \u0457 , \u0407" +   // uk yi
            "< \u0439 , \u0419" +   // short i
            "< \u0458 , \u0408" +   // je
            "< \u043a , \u041a" +   // ka
            "; \u049f , \u049e" +   // ka-stroke
            "; \u04c4 , \u04c3" +   // ka-hook
            "; \u049d , \u049c" +   // ka-vt-stroke
            "; \u04a1 , \u04a0" +   // bashkir-ka
            /*!!!moved after т/te!!!*/ //"; \u045c , \u040c" +   // kje
            "; \u049b , \u049a" +   // ka-descender
            "< \u043b , \u041b" +   // el
            "< \u0459 , \u0409" +   // lje
            "< \u043c , \u041c" +   // em
            "< \u043d , \u041d" +   // en
            "; \u0463 " +           // yat
            "; \u04a3 , \u04a2" +   // en-descender
            "; \u04a5 , \u04a4" +   // en-ghe
            "; \u04bb , \u04ba" +   // shha
            "; \u04c8 , \u04c7" +   // en-hook
            "< \u045a , \u040a" +   // nje
            "< \u043e , \u041e" +   // o
            "; \u04a9 , \u04a8" +   // ha
            "< \u043f , \u041f" +   // pe
            "; \u04a7 , \u04a6" +   // pe-mid-hook
            "< \u0440 , \u0420" +   // er
            "< \u0441 , \u0421" +   // es
            "; \u04ab , \u04aa" +   // es-descender
            "< \u0442 , \u0422" +   // te
            "; \u04ad , \u04ac" +   // te-descender
            "< \u045b , \u040b" +   // tshe
            /*!!!movedAND relation strength changed!!!*/ "< \u045c , \u040c" +   // kje
            "< \u0443 , \u0423" +   // u
            "; \u04af , \u04ae" +   // straight u
            "< \u045e , \u040e" +   // short u
            "< \u04b1 , \u04b0" +   // straight u-stroke
            "< \u0444 , \u0424" +   // ef
            "< \u0445 , \u0425" +   // ha
            "; \u04b3 , \u04b2" +   // ha-descender
            "< \u0446 , \u0426" +   // tse
            "; \u04b5 , \u04b4" +   // te tse
            "< \u0447 , \u0427" +   // che
            "; \u04b7 ; \u04b6" +   // che-descender
            "; \u04b9 , \u04b8" +   // che-vt-stroke
            "; \u04cc , \u04cb" +   // che
            "< \u045f , \u040f" +   // dzhe
            "< \u0448 , \u0428" +   // sha
            "< \u0449 , \u0429" +   // shcha
            "< \u044a , \u042a" +   // hard sign
            "< \u044b , \u042b" +   // yeru
            "< \u044c , \u042c" +   // soft sign
            "< \u044d , \u042d" +   // e
            "< \u044e , \u042e" +   // yu
            "< \u044f , \u042f" +   // ya
            "< \u0461 , \u0460" +   // omega
            "< \u0462 " +           // yat
            "< \u0465 , \u0464" +   // iotified e
            "< \u0467 , \u0466" +   // little yus
            "< \u0469 , \u0468" +   // iotified little yus
            "< \u046b , \u046a" +   // big yus
            "< \u046d , \u046c" +   // iotified big yus
            "< \u046f , \u046e" +   // ksi
            "< \u0471 , \u0470" +   // psi
            "< \u0473 , \u0472" +   // fita
            "< \u0475 , \u0474" +   // izhitsa
            "; \u0477 , \u0476" +   // izhitsa-double-grave
            "< \u0479 , \u0478" +   // uk
            "< \u047b , \u047a" +   // round omega
            "< \u047d , \u047c" +   // omega-titlo
            "< \u047f , \u047e" +   // ot
            "< \u0481 , \u0480"     // koppa
    );
}

The rules can be simplified further to contain only the base 31 letters without accented variants.

like image 141
Oleg Estekhin Avatar answered Oct 22 '22 18:10

Oleg Estekhin