Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Locale based sort in Javascript, sort accented letters and other variants in a predefined way

In Finnish we sort W after V (as in English) but because W is not a native Finnish letter, it is considered as a variant of V, which is sorted as it were equal to V, but in cases where the only difference between two words is that V is W, then V-version is sorted first. An example enlights the proper order:

Vatanen, Watanen, Virtanen

In Finnish V and W are collated as A and Á. Á is sorted like A, but in cases where it is the only difference, the unaccented one comes first. The same rule is for all other accented letters, but the Å, Ä and Ö are collated separately after Z.

Question: What would be the best algorithm to sort this like variants in a predefined way? ( eg. [Watanen, Vatanen, Virtanen] to [Vatanen, Watanen, Virtanen] )?

Addition: The question is relevant to extend to cover also other variants in the way they are defined in http://cldr.unicode.org/index/cldr-spec/collation-guidelines, because the technique would in a great probability be the same and the answers to this question benefit the widest possible audience and sort algorithms can be made compatible with collation rules defined in Unicode CLDR. The Unicode CLDR defines three levels of differences between letters: primary level (base letters), secondary level (accented letters) and tertiary level (character case).

I have thought some kind of array preparation like in numerical sort where we could pad all numbers with zeros to make them comparaple as strings. An example: Array [file1000.jpg, file3.jpg, file22.jpg] can be prepared to make it comparable as strings by padding with zeros this way: [file1000.jpg, file0003.jpg, file0022.jpg]. Because of preparation of array, we can sort it very fast using native Array.sort().

The target language is Javascript, which lacks support for collation based sorts, so the custom sort function have to be made self. The algorithm is preferred, but if you have also code it's worth +1.

like image 695
Timo Kähkönen Avatar asked Sep 27 '12 15:09

Timo Kähkönen


1 Answers

Since the time you originally asked this question, JavaScript is finally acquiring some decent locale support, including for collation.

Read up on the new EcmaScript 6 / Harmony features Intl and, specifically, Intl.Collator.

The documentation doesn't actually make it very clear that both modern and traditional sort orders are supported for Finnish, but I've tried and they are.

To get a collator for the traditional order you need to pass a "fancy" language code string: fi-u-co-trad. For the "reformed" sort order there is fi-u-co-reformed. This breaks down as:

  • fi - ISO 639 language code for Finnish.
  • u - enables Unicode features / options. (not well documented)
  • co - collation options.
  • trad - traditional sort order. I read about this option for Spanish and only found it works for Finnish as well by testing. (not well documented)
  • reformed - reformed sort order. Seems to be an antonym for 'trad'. If you specify neither trad nor reformed you will get default, which may be trad on some browsers and reformed on others.

Teh codez:

var surnames = ['Watanen', 'Vatanen', 'Virtanen'];

var traColl = new Intl.Collator('fi-u-co-trad');
var refColl = new Intl.Collator('fi-u-co-reformed');
var defColl = new Intl.Collator('fi');

console.log('traditional:', traColl.resolved.requestedLocale + ' -> ' + traColl.resolved.collation, surnames.sort(function (a, b) {
  return traColl.compare(a,b);
}));

console.log('reformed:', refColl.resolved.requestedLocale + ' -> ' + refColl.resolved.collation, surnames.sort(function (a, b) {
  return refColl.compare(a,b);
}));

console.log('default:', defColl.resolved.requestedLocale + ' -> ' + defColl.resolved.collation, surnames.sort(function (a, b) {
  return defColl.compare(a,b);
}));

Outputs:

traditional: fi-u-co-trad -> trad ["Vatanen", "Watanen", "Virtanen"]
reformed: fi-u-co-reformed -> reformed ["Vatanen", "Virtanen", "Watanen"]
default: fi -> default ["Vatanen", "Virtanen", "Watanen"]

Tested in Google Chrome, which, from what I read online, is lagging behind Firefox in this stuff.

like image 133
hippietrail Avatar answered Sep 22 '22 15:09

hippietrail