In order to convert a string to UTF-8 and replace all encoding errors, you can do:
str.encode('utf-8', :invalid=>:replace)
The only problem with this is it doesn't work if str
is already UTF-8, in which case any errors remain:
irb> x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
=> "foo\x92bar"
irb> x.valid_encoding?
=> false
To quote the Ruby Docs:
Please note that conversion from an encoding
enc
to the same encodingenc
is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
The obvious workaround is to first convert to a different Unicode encoding and then back to UTF-8:
str.encode('utf-16', :invalid=>:replace).encode('utf-8')
For example:
irb> x = "foo\x92bar".encode('utf-16', :invalid=>:replace).encode('utf-8')
=> "foo�bar"
irb> x.valid_encoding?
=> true
Is there a better way to do this without converting to a dummy encoding?
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
Modified UTF-8 strings are the same as those used by the Java VM. Modified UTF-8 strings are encoded so that character sequences that contain only non-null ASCII characters can be represented using only one byte per character, but all Unicode characters can be represented.
Ruby has the method Encoding. default_external which defines what the current operating systems default encoding is. Ruby defaults to UTF-8 as its encoding so if it is opening up files from the operating system and the default is different from UTF-8, it will transcode the input from that encoding to UTF-8.
Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages.
Try this:
"foo\x92bar".chars.select(&:valid_encoding?).join
# => "foobar"
Or to replace
"foo\x92bar".chars.map{|c| c.valid_encoding? ? c : "?"}.join
# => "foo?bar"
Ruby 2.1 has added a String#scrub
method that does what you want:
2.1.0dev :001 > x = "foo\x92bar"
=> "foo\x92bar"
2.1.0dev :002 > x.valid_encoding?
=> false
2.1.0dev :003 > y = x.scrub
=> "foo�bar"
2.1.0dev :004 > y.valid_encoding?
=> true
The same commit also changes the behaviour of encode
so that it works when the source and dest encodings are the same:
2.1.0dev :005 > x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
=> "foo�bar"
2.1.0dev :006 > x.valid_encoding?
=> true
As far as I know there is no built in way to do this before 2.1 (otherwise scrub
wouldn’t be needed) so you’ll need to use some workaround technique until 2.1 is released and you can upgrade.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With