Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ruby 1.9, force_encoding, but check

I have a string I have read from some kind of input.

To the best of my knowledge, it is UTF8. Okay:

string.force_encoding("utf8")

But if this string has bytes in it that are not in fact legal UTF8, I want to know now and take action.

Ordinarily, will force_encoding("utf8") raise if it encounters such bytes? I believe it will not.

If I was doing an #encode I could choose from the handy options with what to do with characters that are invalid in the source encoding (or destination encoding).

But I'm not doing an #encode, I'm doing a #force_encoding. It has no such options.

Would it make sense to

string.force_encoding("utf8").encode("utf8")

to get an exception right away? Normally encoding from utf8 to utf8 doesn't make any sense. But maybe this is the way to get it to raise right away if there's invalid bytes? Or use the :replace option etc to do something different with invalid bytes?

But no, can't seem to make that work either.

Anyone know?

1.9.3-p0 :032 > a = "bad: \xc3\x28 okay".force_encoding("utf-8")
=> "bad: \xC3( okay"
1.9.3-p0 :033 > a.valid_encoding?
=> false

Okay, but how do I find and eliminate those bad bytes? Oddly, this does NOT raise:

1.9.3-p0 :035 > a.encode("utf-8")
 => "bad: \xC3( okay"

If I was converting to a different encoding, it would!

1.9.3-p0 :039 > a.encode("ISO-8859-1")
Encoding::InvalidByteSequenceError: "\xC3" followed by "(" on UTF-8

Or if I told it to, it'd replace it with a "?" =>

1.9.3-p0 :040 > a.encode("ISO-8859-1", :invalid => :replace)
=> "bad: ?( okay"

So ruby's got the smarts to know what are bad bytes in utf-8, and to replace em with something else -- when converting to a different encoding. But I don't want to convert to a different encoding, i want to stay utf8 -- but I might want to raise if there's an invalid byte in there, or I might want to replace invalid bytes with replacement chars.

Isn't there some way to get ruby to do this?

update I believe this has finally been added to ruby in 2.1, with String#scrub present in the 2.1 preview release to do this. So look for that!

like image 746
jrochkind Avatar asked Apr 17 '12 23:04

jrochkind


3 Answers

In ruby 2.1, the stdlib finally supports this with scrub.

http://ruby-doc.org/core-2.1.0/String.html#method-i-scrub

like image 153
jrochkind Avatar answered Nov 08 '22 14:11

jrochkind


make sure that your scriptfile itself is saved as UTF8 and try the following

# encoding: UTF-8
p [a = "bad: \xc3\x28 okay", a.valid_encoding?]
p [a.force_encoding("utf-8"), a.valid_encoding?]
p [a.encode!("ISO-8859-1", :invalid => :replace), a.valid_encoding?]

This gives on my windows7 system the following

["bad: \xC3( okay", false]
["bad: \xC3( okay", false]
["bad: ?( okay", true]

So your bad char is replaced, you can do it right away as follows

a = "bad: \xc3\x28 okay".encode!("ISO-8859-1", :invalid => :replace)
=> "bad: ?( okay"

EDIT: here a solution that works on any arbitrary encoding, the first encodes only the bad chars, the second just replaces by a ?

def validate_encoding(str)
  str.chars.collect do |c| 
    (c.valid_encoding?) ? c:c.encode!(Encoding.locale_charmap, :invalid => :replace)
  end.join 
end

def validate_encoding2(str)
  str.chars.collect do |c| 
    (c.valid_encoding?) ? c:'?'
  end.join 
end

a = "bad: \xc3\x28 okay"

puts validate_encoding(a)                  #=>bad: ?( okay
puts validate_encoding(a).valid_encoding?  #=>true


puts validate_encoding2(a)                  #=>bad: ?( okay
puts validate_encoding2(a).valid_encoding?  #=>true
like image 4
peter Avatar answered Nov 08 '22 14:11

peter


To check that a string has no invalid sequences, try to convert it to the binary encoding:

# Returns true if the string has only valid sequences
def valid_encoding?(string)
  string.encode('binary', :undef => :replace)
  true
rescue Encoding::InvalidByteSequenceError => e
  false
end

p valid_encoding?("\xc0".force_encoding('iso-8859-1'))    # true
p valid_encoding?("\u1111")                               # true
p valid_encoding?("\xc0".force_encoding('utf-8'))         # false

This code replaces undefined characters, because we don't care if there are valid sequences that cannot be represented in binary. We only care if there are invalid sequences.

A slight modification to this code returns the actual error, which has valuable information about the improper encoding:

# Returns the encoding error, or nil if there isn't one.

def encoding_error(string)
  string.encode('binary', :undef => :replace)
  nil
rescue Encoding::InvalidByteSequenceError => e
  e.to_s
end

# Returns truthy if the string has only valid sequences

def valid_encoding?(string)
  !encoding_error(string)
end

puts encoding_error("\xc0".force_encoding('iso-8859-1'))    # nil
puts encoding_error("\u1111")                               # nil
puts encoding_error("\xc0".force_encoding('utf-8'))         # "\xC0" on UTF-8
like image 3
Wayne Conrad Avatar answered Nov 08 '22 14:11

Wayne Conrad