I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards compatibility.
I can't know the input encoding.
Exemple:
> "- Men\xFC -".split("n")
ArgumentError: invalid byte sequence in UTF-8
from (irb):4:in `split'
from (irb):4
from /home/fotanus/.rvm/rubies/ruby-2.0.0-rc2/bin/irb:16:in `<main>'
I can overcome this problem in one line, by using the following, for example:
> "- Men\xFC -".unpack("C*").pack("U*").split("n")
=> ["- Me", "ü -"]
However I would like to always ignore the invalid byte sequences and disable this errors. On Ruby itself or in Rails.
The '8' signifies that it allocates 8-bit blocks to denote a character. The number of blocks needed to represent a character varies from 1 to 4. In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array.
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
A valid UTF-8 character can be 1 - 4 bytes long. For a 1-byte character, the first bit is a 0 , followed by its unicode. For an n-bytes character, the first n-bits are all ones, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10 .
I don't think you can globally turn off the UTF-8 checking without much difficulty. I would instead focus on fixing up all the strings that enter your application, at the boundary where they come in (e.g. when you query the database or receive HTTP requests).
Let's suppose the strings coming in have the BINARY (a.k.a. ASCII-8BIT encoding). This can be simulated like this:
s = "Men\xFC".force_encoding('BINARY') # => "Men\xFC"
Then we can convert them to UTF-8 using String#encode and replace any undefined characters with the UTF-8 replacement character:
s = s.encode("UTF-8", invalid: :replace, undef: :replace) # => "Men\uFFFD"
s.valid_encoding? # => true
Unfortunately, the steps above would end up mangling a lot of UTF-8 codepoints because the bytes in them would not be recognized. If you had a three-byte UTF-8 characters like "\uFFFD" it would be interpreted as three separate bytes and each one would get converted to the replacement character. Maybe you could do something like this:
def to_utf8(str)
str = str.force_encoding("UTF-8")
return str if str.valid_encoding?
str = str.force_encoding("BINARY")
str.encode("UTF-8", invalid: :replace, undef: :replace)
end
That's the best I could think of. Unfortunately, I don't know of a great way to tell Ruby to treat the string as UTF-8 and just replace all the invalid bytes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With