Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way in ruby 1.9 to remove invalid byte sequences from strings?

Suppose you have a string like "€foo\xA0", encoded UTF-8, Is there a way to remove invalid byte sequences from this string? ( so you get "€foo" )

In ruby-1.8 you could use Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "€foo\xA0") but that is now deprecated. "€foo\xA0".encode('UTF-8') doesn't do anything, since it is already UTF-8. I tried:

"€foo\xA0".force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '') 

which yields

"foo"

But that also loses the valid multibyte character €

like image 234
StefanH Avatar asked Jan 03 '12 09:01

StefanH


People also ask

Is UTF-8 unicode?

UTF-8 is a Unicode character encoding method. This means that UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.

What is the sequence of Ruby string in bytes?

But the most important evolution is that in Ruby 1.8, strings are considered as a sequence of bytes when in Ruby 1.9, strings are considered as a sequence of codepoints. A sequence of codepoints, coupled to a specific encoding, allows Ruby to handle encodings. Indeed, on disk, a string is stored as a sequence of bytes.

What is Ruby encoding?

Ruby defaults to UTF-8 as its encoding so if it is opening up files from the operating system and the default is different from UTF-8, it will transcode the input from that encoding to UTF-8. If this isn't desirable, you may change the default internal encoding in Ruby with Encoding.


2 Answers

"€foo\xA0".encode('UTF-16le', invalid: :replace, replace: '').encode('UTF-8') 
like image 165
Van der Hoorn Avatar answered Sep 19 '22 19:09

Van der Hoorn


"€foo\xA0".chars.select(&:valid_encoding?).join 
like image 42
Evgenii Avatar answered Sep 19 '22 19:09

Evgenii