Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting UTF-8 strings in RoR

I am trying to figure out a 'proper' way of sorting UTF-8 strings in Ruby on Rails.

In my application, I have a select box that is populated with countries. As my application is localized, each existing locale has a countries.yml file that relates a country's id to the localized name for that country. I can't sort the strings manually in the yml file because I need the ID to be consistent across all locales.

What I have done is create a ascii_name method which uses the unidecode gem to convert accented and non-latin characters to their ascii equivalent (for instance, "Afeganistão" would become "Afeganistao"), and then sort on that:

require 'unidecode'

class Country
  def ascii_name
    Unidecoder.decode(name).gsub("[?]", "").gsub(/`/, "'").strip
  end
end

Country.all.sort_by(:&ascii_name)

However, there are obvious issues with this:

  • It cannot properly sort non-latin locales, as there may not be a direct analogous latin character.
  • It makes no distinction between a letter and all accented forms of that letter (so, for instance, A and Ä become interchangeable)

Does anyone know of a better way that I could sort my strings?

like image 776
Daniel Vandersluis Avatar asked Jun 11 '09 18:06

Daniel Vandersluis


4 Answers

Ruby peforms string comparisons based on byte values of characters:

%w[à a e].sort
# => ["a", "e", "à"]

To properly collate strings according to locale, the ffi-icu gem could be used:

require "ffi-icu"

ICU::Collation.collate("it_IT", %w[à a e])
# => ["a", "à", "e"]

ICU::Collation.collate("de", %w[a s x ß])
# => ["a", "s", "ß", "x"]

As an alternative:

collator = ICU::Collation::Collator.new("it_IT")
%w[à a e].sort { |a, b| collator.compare(a, b) }
# => %w[a à e]

Update To test how strings should collate according to locale rules the ICU project provides this nice tool.

like image 193
toro2k Avatar answered Oct 20 '22 10:10

toro2k


http://github.com/grosser/sort_alphabetical

This gem should help. It adds sort_alphabetical and sort_alphabetical_by methods to Enumberable.

like image 43
İ. Emre Kutlu Avatar answered Oct 20 '22 10:10

İ. Emre Kutlu


The only solution I have found thus far is to use ActiveSupport::Inflector.transliterate(string) to replace the unicode characters with ASCII ones and sort:

Country.all.sort_by do |country|
  ActiveSupport::Inflector.transliterate country.name
end

Now the only problem is that this equalizes "ä" with "a" (DIN 5007-1) and I end up with "Ägypten" before "Albanien" while I would expect it to be the other way around. Thankfully the transliteration is configurable about how to replace characters.

See documentation: http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate

like image 4
Kostas Avatar answered Oct 20 '22 11:10

Kostas


There are a couple of ways to go. You may want to convert the UTF strings to hex strings and then sort them:

s.split(//).collect { |x| x.unpack('U').to_s }.join

or you may use the library iconv. Read up on it and use it as appropriate (from dzone):

#add this to environment.rb
#call to_iso on any UTF8 string to get a ISO string back
#example : "Cédez le passage aux français".to_iso

class String
  require 'iconv' #this line is not needed in rails !
  def to_iso
    Iconv.conv('ISO-8859-1', 'utf-8', self)
  end
end
like image 1
Ryan Oberoi Avatar answered Oct 20 '22 10:10

Ryan Oberoi