How can I globally ignore invalid byte sequences in UTF-8 strings?

Tags:

I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards compatibility.

I can't know the input encoding.

Exemple:

> "- Men\xFC -".split("n")
ArgumentError: invalid byte sequence in UTF-8
    from (irb):4:in `split'
    from (irb):4
    from /home/fotanus/.rvm/rubies/ruby-2.0.0-rc2/bin/irb:16:in `<main>'

I can overcome this problem in one line, by using the following, for example:

> "- Men\xFC -".unpack("C*").pack("U*").split("n")
 => ["- Me", "ü -"]

However I would like to always ignore the invalid byte sequences and disable this errors. On Ruby itself or in Rails.

399

asked Jun 07 '13 15:06

fotanus

1 Answers

I don't think you can globally turn off the UTF-8 checking without much difficulty. I would instead focus on fixing up all the strings that enter your application, at the boundary where they come in (e.g. when you query the database or receive HTTP requests).

Let's suppose the strings coming in have the BINARY (a.k.a. ASCII-8BIT encoding). This can be simulated like this:

s = "Men\xFC".force_encoding('BINARY')  # => "Men\xFC"

Then we can convert them to UTF-8 using String#encode and replace any undefined characters with the UTF-8 replacement character:

s = s.encode("UTF-8", invalid: :replace, undef: :replace)  # => "Men\uFFFD"
s.valid_encoding?  # => true

Unfortunately, the steps above would end up mangling a lot of UTF-8 codepoints because the bytes in them would not be recognized. If you had a three-byte UTF-8 characters like "\uFFFD" it would be interpreted as three separate bytes and each one would get converted to the replacement character. Maybe you could do something like this:

def to_utf8(str)
  str = str.force_encoding("UTF-8")
  return str if str.valid_encoding?
  str = str.force_encoding("BINARY")
  str.encode("UTF-8", invalid: :replace, undef: :replace)
end

That's the best I could think of. Unfortunately, I don't know of a great way to tell Ruby to treat the string as UTF-8 and just replace all the invalid bytes.

189

answered Oct 16 '22 17:10

David Grayson

Related questions
                            
                                AJAX form (using simple_form) with preserved error validation
                            
                                Ruby error: cannot load such file -- rest-client
                            
                                How do I get the current action from within a controller?
                            
                                Devise + Omniauth: undefined method `user_omniauth_authorize_path'
                            
                                rails renaming associations
                            
                                SessionsHelper in railstutorial.org: Should helpers be general-purpose modules for code not needed in views?
                            
                                Rails initializer that runs *after* routes are loaded?
                            
                                Where to put constraint classes in Rails project
                            
                                How do I render a partial to a string?
                            
                                Rails - Joining multiple tables
                            
                                Rails: How do I require ActiveSupport's rescue_from method?
                            
                                What is the difference between `stream_from` and `stream_for` in ActionCable?
                            
                                In Ruby on Rails, what does authenticate_with_http_basic do?
                            
                                rspec test result from csv.read mocking file
                            
                                Invalid gemspec -Illformed requirement ["#<YAML::Syck::DefaultKey:0xb5f9c990> 3.2.0"]
                            
                                fixture_file_upload has {file} does not exist error
                            
                                How do I test ajax POST request with RSpec?
                            
                                link_to in helper with block
                            
                                remove console.log on assets precompile
                            
                                jQuery Scroll To Section of Page

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I globally ignore invalid byte sequences in UTF-8 strings?

Tags:

ruby

ruby-on-rails

encoding

fotanus

People also ask

1 Answers

David Grayson

Recent Activity

Donate For Us