Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Equivalent of Iconv.conv("UTF-8//IGNORE",...) in Ruby 1.9.X?

I'm reading data from a remote source, and occassionally get some characters in another encoding. They're not important.

I'd like to get get a "best guess" utf-8 string, and ignore the invalid data.

Main goal is to get a string I can use, and not run into errors such as:

  • Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8:
  • invalid byte sequence in utf-8
like image 448
Jordan Feldstein Avatar asked Oct 24 '11 01:10

Jordan Feldstein


Video Answer


2 Answers

I thought this was it:

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")

will replace all knowns with '?'.

To ignore all unknowns, :replace => '':

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

Edit:

I'm not sure this is reliable. I've gone into paranoid-mode, and have been using:

string.encode("UTF-8", ...).force_encoding('UTF-8')

Script seems to be running, ok now. But I'm pretty sure I'd gotten errors with this earlier.

Edit 2:

Even with this, I continue to get intermittant errors. Not every time, mind you. Just sometimes.

like image 51
Jordan Feldstein Avatar answered Oct 04 '22 05:10

Jordan Feldstein


String#chars or String#each_char can be also used.

# Table 3-8. Use of U+FFFD in UTF-8 Conversion
# http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf)
str = "\x61"+"\xF1\x80\x80"+"\xE1\x80"+"\xC2"
     +"\x62"+"\x80"+"\x63"+"\x80"+"\xBF"+"\x64"

p [
  'abcd' == str.chars.collect { |c| (c.valid_encoding?) ? c : '' }.join,
  'abcd' == str.each_char.map { |c| (c.valid_encoding?) ? c : '' }.join
]

String#scrub can be used since Ruby 2.1.

p [
  'abcd' == str.scrub(''),
  'abcd' == str.scrub{ |c| '' }
]
like image 38
masakielastic Avatar answered Oct 04 '22 05:10

masakielastic