In order to convert a string to UTF-8 and replace all encoding errors, you can do: <pre class="prettyprint"><code>str.encode('utf-8', :invalid=>:replace) </code></pre> The only problem with this is it doesn't work if <code>str</code> is already UTF-8, in which case any errors remain: <pre class="prettyprint"><code>irb> x = "foo\x92bar".encode('utf-8', :invalid=>:replace) => "foo\x92bar" irb> x.valid_encoding? => false </code></pre> To quote the Ruby Docs: <blockquote> Please note that conversion from an encoding <code>enc</code> to the same encoding <code>enc</code> is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes. </blockquote> The obvious workaround is to first convert to a different Unicode encoding and then back to UTF-8: <pre class="prettyprint"><code>str.encode('utf-16', :invalid=>:replace).encode('utf-8') </code></pre> For example: <pre class="prettyprint"><code>irb> x = "foo\x92bar".encode('utf-16', :invalid=>:replace).encode('utf-8') => "foo�bar" irb> x.valid_encoding? => true </code></pre> Is there a better way to do this without converting to a dummy encoding?

Try this: <pre class="prettyprint"><code> "foo\x92bar".chars.select(&:valid_encoding?).join # => "foobar" </code></pre> Or to replace <pre class="prettyprint"><code>"foo\x92bar".chars.map{|c| c.valid_encoding? ? c : "?"}.join # => "foo?bar" </code></pre>

How can I replace UTF-8 errors in Ruby without converting to a different encoding?

Tags:

string

ruby

encoding

unicode

utf-8

In order to convert a string to UTF-8 and replace all encoding errors, you can do:

str.encode('utf-8', :invalid=>:replace)

The only problem with this is it doesn't work if str is already UTF-8, in which case any errors remain:

irb> x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
=> "foo\x92bar"
irb> x.valid_encoding?
=> false

To quote the Ruby Docs:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

The obvious workaround is to first convert to a different Unicode encoding and then back to UTF-8:

str.encode('utf-16', :invalid=>:replace).encode('utf-8')

For example:

irb> x = "foo\x92bar".encode('utf-16', :invalid=>:replace).encode('utf-8')
=> "foo�bar"
irb> x.valid_encoding?
=> true

Is there a better way to do this without converting to a dummy encoding?

463

asked Oct 03 '13 16:10

Matt

2 Answers

Try this:

 "foo\x92bar".chars.select(&:valid_encoding?).join
  # => "foobar"

Or to replace

"foo\x92bar".chars.map{|c| c.valid_encoding? ? c : "?"}.join
 # =>  "foo?bar"

116

answered Oct 04 '22 09:10

tihom

Ruby 2.1 has added a String#scrub method that does what you want:

2.1.0dev :001 > x = "foo\x92bar"
 => "foo\x92bar" 
2.1.0dev :002 > x.valid_encoding?
 => false 
2.1.0dev :003 > y = x.scrub
 => "foo�bar" 
2.1.0dev :004 > y.valid_encoding?
 => true

The same commit also changes the behaviour of encode so that it works when the source and dest encodings are the same:

2.1.0dev :005 > x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
 => "foo�bar" 
2.1.0dev :006 > x.valid_encoding?
 => true

As far as I know there is no built in way to do this before 2.1 (otherwise scrub wouldn’t be needed) so you’ll need to use some workaround technique until 2.1 is released and you can upgrade.

answered Oct 04 '22 08:10

matt

Related questions
                            
                                How to get rspec to not show db queries and just dots with rails_12factor?
                            
                                Specify authentication database in mongoid.yml
                            
                                Nested singleton class method lookup
                            
                                How do I flatten an array in Ruby?
                            
                                Ruby Enterprise Edition vs Ruby 1.9
                            
                                Ruby - test each array element, get one result
                            
                                How do you write a compiler for a language in that language? [duplicate]
                            
                                || Operator, return when result is known?
                            
                                Rails Date After
                            
                                How do I read a gzip file line by line?
                            
                                How can I write to Rails logger within my gem
                            
                                Ruby String#scan equivalent to return MatchData
                            
                                Why use GemSpec + GemFile when checking for dependencies?
                            
                                How do you model "Likes" in rails?
                            
                                Elastic Search/Tire: How do I filter a boolean attribute?
                            
                                Problems installing Ruby on Mountain Lion - ruby 1.9.3 wont' compile
                            
                                Rails: each in random order
                            
                                ruby on rails: How to create table for a new model
                            
                                How does ruby handle zero division?
                            
                                Rails check_box_tag checked according boolean value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With