Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

localeCompare shows inconsistent behavior when sorting words with leading umlaut characters

Tested in latest Firefox and Chrome (which have a 'de' locale on my system):

"Ä".localeCompare("A")

gives me 1, meaning that it believes "Ä" should appear after "A" in a sorted order, which is correct.

But:

"Ägypten".localeCompare("Algerien")

gives me -1, meaning that it believes "Ägypten" should appear before "Algerien" in a sorted order.

Why? Why does it look past the first character of each string, if it says that the first character of the first string should appear after the first character of the second string when you check it on its own?

like image 894
grssnbchr Avatar asked Mar 10 '15 14:03

grssnbchr


2 Answers

Here you have method just for your needs, copy paste this method:

Recursive parse of strings and give char locale compare result not string :)

FINAL RESULT Bug Fixed, added compare (incorrect stoppage or recursive loop) to entire strings:

String.prototype.MylocaleCompare = function (right, idx){
    idx = (idx == undefined) ? 0 : idx++;

    var run = right.length <= this.length ? (idx < right.length - 1 ? true : false) : (idx < this.length - 1 ? true : false);


    if (!run) 
    {
        if (this[0].localeCompare(right[0]) == 0)
            {
                return this.localeCompare(right);
            }
            else
            {
                return this[0].localeCompare(right[0])
            }
    }

    if(this.localeCompare(right) != this[0].localeCompare(right[0]))
    {
        var myLeft = this.slice(1, this.length);
        var myRight = right.slice(1, right.length);
        if (myLeft.localeCompare(myRight) != myLeft[0].localeCompare(myRight[0]))
        {
            return myLeft.MylocaleCompare(myRight, idx);
        }
        else
        {
            if (this[0].localeCompare(right[0]) == 0)
            {
                return myLeft.MylocaleCompare(myRight, idx);
            }
            else
            {
                return this[0].localeCompare(right[0])
            }
        }
    }
    else
    {
        return this.localeCompare(right);
    }

}
like image 53
SilentTremor Avatar answered Nov 14 '22 22:11

SilentTremor


http://en.wikipedia.org/wiki/Diaeresis_(diacritic)#Printing_conventions_in_German

“When alphabetically sorting German words, the umlaut is usually not distinguished from the underlying vowel, although if two words differ only by an umlaut, the umlauted one comes second […]
“There is a second system in limited use, mostly for sorting names (colloquially called "telephone directory sorting"), which treats ü like ue, and so on.”

Assuming the second kind of sorting algorithm is applied, then the results you are seeing make sense.

Ä would become Ae, and that is “longer” then your other value A, so sorting A before Ae and therefor A before Ä would be correct (and as you said yourself, you consider this to be correct; and even by the first algorithm that just treats Ä as A it would be correct, too).

Now Ägypten becomes Aegypten for sorting purposes, and therefor it has to appear before Algerien in the same sorting logic – the first letters of both terms are equal, so it is up to the second ones to determine sort order, and e has a lexicographically lower sort value than l. Therefor, Aegypten before Algerien, meaning Ägypten before Algerien.


German Wikipedia elaborates even more about this (http://de.wikipedia.org/wiki/Alphabetische_Sortierung#Einsortierungsregeln_f.C3.BCr_weitere_Buchstaben), and notes that there are two variants of the relevant DIN 5007.

DIN 5007 variant 1 says, ä is to be treated as a, ö as o and ü as u, and that this kind of sorting was to be used for dictionaries and the like.

DIN 5007 variant 1 says the other thing, ä to be treated as ae, etc., and this is to be used mainly for name listings such as telephone books.

Wikipedia goes on to say that this takes into account that there might be more than one form of spelling for personal names (someone’s last name might be Moeller or Möller, both versions exist), whereas for words in a dictionary there is usually only one spelling that is considered correct.


Now, I guess the price question remaining is: Can I get browsers to apply the other form of sorting for German locale? To be frank, I don’t know.

It might surely be desirable to be able to chose between those two forms of sorting, because as the Wikipedia says personal names Moeller and Möller exist, but there is only Ägypten and not Aegypten when it comes to a dictionary.

like image 36
CBroe Avatar answered Nov 15 '22 00:11

CBroe