Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby character transliteration

What's the current best way to transliterate characters to 7-bit ASCII in Ruby? Most of questions I've seen on SO are 3 or 4 years old and the solutions don't fully work.

I want a method that will work for a wide range of Latin alphabets and, for example, convert

Your résumé’s a non–encyclopædia

to

Your resume's a non-encyclopaedia

but I cannot find a way that does that, particularly for folding 8-bit ASCII to 7-bit ASCII.

s =  "Your r\u00e9sum\u00e9\u2019s a non\u2013encyclop\u00e6dia"
puts Iconv.iconv('ascii//ignore//translit', 'utf-8', s)
# => Your r'esum'e's a non-encyclopaedia
puts s.encode('ascii//ignore//translit', 'utf-8')
# => Encoding::ConverterNotFoundError: code converter not found (UTF-8 to ascii//ignore//translit)
puts s.encode('ascii', 'utf-8')
# Encoding::UndefinedConversionError: U+00E9 from UTF-8 to US-ASCII
puts s.encode('ascii', 'utf-8', invalid: :replace, undef: :replace)
# Your r?sum??s a non?encyclop?dia
puts I18n.transliterate(s)
# Your resume?s a non?encyclopaedia

Since Iconv is deprecated I'd rather not use that if I don't have to, but I'd do it if that is the only thing that works. Obviously I could put in custom 8-bit ASCII to 7-bit ASCII translations, but I'd prefer to use a supported solution that has been thoroughly tested.

The translation is handled fine by International Components for Unicode with its Latin-ASCII translation, but that is only available for Java and C.

UPDATE

What I ended up doing was writing my own character translation routines to take care of punctuation and whitespace, after which I could use I18n.transliterate to do the rest. I'd still prefer finding and using a well-maintained library function to handle the stuff I18n does not.

like image 679
Old Pro Avatar asked Jun 16 '26 07:06

Old Pro


1 Answers

If you're willing to add a somewhat heavy dependency (unless your already on Rails), ActiveSupport has support (pun not intended) for this:

ActiveSupport::Multibyte::Chars.new("Your r\u00e9sum\u00e9\u2019s not an encyclop\u00e6dia").mb_chars.normalize(:kd).chars.to_a.delete_if {|c| !c.ascii_only?}.join('')

This works for all of the letters. It doesn't handle the apostrophe right yet though.

like image 129
Linuxios Avatar answered Jun 18 '26 01:06

Linuxios