Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort values using a specific collation in Ruby/Rails

Is it possible to sort an array of values using a specific collation in Ruby? I have a need to sort according to the da_DK collation.

Given the array %w(Aarhus Aalborg Assens) I would like to have ['Assens', 'Aalborg', 'Aarhus'] back which is the correct order in Danish.

The standard sort method

%w(Aarhus Aalborg Assens).sort

returns something that looks like the ascii order (at least not the Danish order):

["Aalborg", "Aarhus", "Assens"]

The environment is both Snow Leopard and linux running ruby 1.9.2 and Rails 3.0.5.

like image 464
HakonB Avatar asked Mar 28 '11 22:03

HakonB


2 Answers

According to Wikipedia:

In the Danish and Norwegian alphabets, the same extra vowels as in Swedish (see below) are also present but in a different order and with different glyphs (..., X, Y, Z, Æ, Ø, Å). Also, "Aa" collates as an equivalent to "Å". The Danish alphabet has traditionally seen "W" as a variant of "V", but today "W" is considered a separate letter."

This would throw off sorting.

Do this to fix the problem:

names = %w(Aarhus Aalborg Assens)
names.sort_by { |w| w.gsub('Aa', 'Å') } # => ["Assens", "Aalborg", "Aarhus"]

and something similar for the other letters that have compound character combinations to convert to the single character.

The reason this works is sort_by does a Schwartzian Transformation, so it's actually sorting by the return value returned from the block, which, in this case, is the name with 'Aa' replaced with 'Å'. The replacement is temporary, and discarded when the array is sorted.

sort_by is very powerful, but does have some overhead. For a simple sort you should use sort because its faster. For sorts where you're comparing two simple values at the top level of an object then it becomes a wash whether you should use sort or sort_by. If you have to do more complex calculations or dig around in an object then sort_by can prove to be faster. There isn't a real hard-and-fast way to know which is better, so I strongly recommend testing with a benchmark if you have to sort large arrays or deal with objects because the difference can be large, and sometimes sort can be the better choice.

EDIT:

Ruby, by itself, isn't going to do what you want, because it has no knowledge of the sort order of every character set out there. There's a discussion regarding incorporating IBM's ICU that explains why that is. If you want ICU's abilities, you could look into ICU4R. I haven't played with it, but it sounds like your only real solution in Ruby.

You might be able to do something with a database like Postgres. They support various collating options but usually force you to declare the collation when you create the database... or maybe it's when the table is created... it's been a while since I created a new table. Anyway, that'd be an option, though it would be a pain.

like image 92
the Tin Man Avatar answered Oct 12 '22 04:10

the Tin Man


I found the ffi-locale on Github and that solves my problem as far as I can see.

It allows the following code:

FFILocale::setlocale FFILocale::LC_COLLATE, 'da_DK.UTF-8'
%w(Aarhus Aalborg Assens).sort { |a,b| FFILocale::strcoll(a, b) }

Which returns the correct result:

=> ["Assens", "Aalborg", "Aarhus"]

I haven't investigated performance yet but it calls out to native code so it ought to be faster that Ruby character replacement code...

Update
It is not perfect :( It does not work properly on Snow Leopard - it seems that the strcoll function is broken on OS X and have been for some time. It is annoying to me but the main platform for deployment is linux - where it works - so it is my currently preferred solution.

like image 20
HakonB Avatar answered Oct 12 '22 04:10

HakonB