Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I globally ignore invalid byte sequences in UTF-8 strings?

I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards compatibility.

I can't know the input encoding.

Exemple:

> "- Men\xFC -".split("n")
ArgumentError: invalid byte sequence in UTF-8
    from (irb):4:in `split'
    from (irb):4
    from /home/fotanus/.rvm/rubies/ruby-2.0.0-rc2/bin/irb:16:in `<main>'

I can overcome this problem in one line, by using the following, for example:

> "- Men\xFC -".unpack("C*").pack("U*").split("n")
 => ["- Me", "ü -"] 

However I would like to always ignore the invalid byte sequences and disable this errors. On Ruby itself or in Rails.

like image 399
fotanus Avatar asked Jun 07 '13 15:06

fotanus


People also ask

What is getBytes UTF-8?

The '8' signifies that it allocates 8-bit blocks to denote a character. The number of blocks needed to represent a character varies from 1 to 4. In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array.

How many bytes is a string in UTF-8?

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.

What are UTF-8 strings?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

What is UTF-8 valid?

A valid UTF-8 character can be 1 - 4 bytes long. For a 1-byte character, the first bit is a 0 , followed by its unicode. For an n-bytes character, the first n-bits are all ones, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10 .


1 Answers

I don't think you can globally turn off the UTF-8 checking without much difficulty. I would instead focus on fixing up all the strings that enter your application, at the boundary where they come in (e.g. when you query the database or receive HTTP requests).

Let's suppose the strings coming in have the BINARY (a.k.a. ASCII-8BIT encoding). This can be simulated like this:

s = "Men\xFC".force_encoding('BINARY')  # => "Men\xFC"

Then we can convert them to UTF-8 using String#encode and replace any undefined characters with the UTF-8 replacement character:

s = s.encode("UTF-8", invalid: :replace, undef: :replace)  # => "Men\uFFFD"
s.valid_encoding?  # => true

Unfortunately, the steps above would end up mangling a lot of UTF-8 codepoints because the bytes in them would not be recognized. If you had a three-byte UTF-8 characters like "\uFFFD" it would be interpreted as three separate bytes and each one would get converted to the replacement character. Maybe you could do something like this:

def to_utf8(str)
  str = str.force_encoding("UTF-8")
  return str if str.valid_encoding?
  str = str.force_encoding("BINARY")
  str.encode("UTF-8", invalid: :replace, undef: :replace)
end

That's the best I could think of. Unfortunately, I don't know of a great way to tell Ruby to treat the string as UTF-8 and just replace all the invalid bytes.

like image 189
David Grayson Avatar answered Oct 16 '22 17:10

David Grayson