I am trying to figure out a 'proper' way of sorting UTF-8 strings in Ruby on Rails.
In my application, I have a select box that is populated with countries. As my application is localized, each existing locale has a countries.yml file that relates a country's id to the localized name for that country. I can't sort the strings manually in the yml file because I need the ID to be consistent across all locales.
What I have done is create a ascii_name
method which uses the unidecode
gem to convert accented and non-latin characters to their ascii equivalent (for instance, "Afeganistão" would become "Afeganistao"), and then sort on that:
require 'unidecode'
class Country
def ascii_name
Unidecoder.decode(name).gsub("[?]", "").gsub(/`/, "'").strip
end
end
Country.all.sort_by(:&ascii_name)
However, there are obvious issues with this:
Does anyone know of a better way that I could sort my strings?
Ruby peforms string comparisons based on byte values of characters:
%w[à a e].sort
# => ["a", "e", "à"]
To properly collate strings according to locale, the ffi-icu gem could be used:
require "ffi-icu"
ICU::Collation.collate("it_IT", %w[à a e])
# => ["a", "à", "e"]
ICU::Collation.collate("de", %w[a s x ß])
# => ["a", "s", "ß", "x"]
As an alternative:
collator = ICU::Collation::Collator.new("it_IT")
%w[à a e].sort { |a, b| collator.compare(a, b) }
# => %w[a à e]
Update To test how strings should collate according to locale rules the ICU project provides this nice tool.
http://github.com/grosser/sort_alphabetical
This gem should help. It adds sort_alphabetical
and sort_alphabetical_by
methods to Enumberable.
The only solution I have found thus far is to use ActiveSupport::Inflector.transliterate(string)
to replace the unicode characters with ASCII ones and sort:
Country.all.sort_by do |country|
ActiveSupport::Inflector.transliterate country.name
end
Now the only problem is that this equalizes "ä" with "a" (DIN 5007-1) and I end up with "Ägypten" before "Albanien" while I would expect it to be the other way around. Thankfully the transliteration is configurable about how to replace characters.
See documentation: http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate
There are a couple of ways to go. You may want to convert the UTF strings to hex strings and then sort them:
s.split(//).collect { |x| x.unpack('U').to_s }.join
or you may use the library iconv. Read up on it and use it as appropriate (from dzone):
#add this to environment.rb
#call to_iso on any UTF8 string to get a ISO string back
#example : "Cédez le passage aux français".to_iso
class String
require 'iconv' #this line is not needed in rails !
def to_iso
Iconv.conv('ISO-8859-1', 'utf-8', self)
end
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With