Suppose you have a string like "€foo\xA0"
, encoded UTF-8, Is there a way to remove invalid byte sequences from this string? ( so you get "€foo"
)
In ruby-1.8 you could use Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "€foo\xA0")
but that is now deprecated. "€foo\xA0".encode('UTF-8')
doesn't do anything, since it is already UTF-8. I tried:
"€foo\xA0".force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')
which yields
"foo"
But that also loses the valid multibyte character €
UTF-8 is a Unicode character encoding method. This means that UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.
But the most important evolution is that in Ruby 1.8, strings are considered as a sequence of bytes when in Ruby 1.9, strings are considered as a sequence of codepoints. A sequence of codepoints, coupled to a specific encoding, allows Ruby to handle encodings. Indeed, on disk, a string is stored as a sequence of bytes.
Ruby defaults to UTF-8 as its encoding so if it is opening up files from the operating system and the default is different from UTF-8, it will transcode the input from that encoding to UTF-8. If this isn't desirable, you may change the default internal encoding in Ruby with Encoding.
"€foo\xA0".encode('UTF-16le', invalid: :replace, replace: '').encode('UTF-8')
"€foo\xA0".chars.select(&:valid_encoding?).join
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With