In IRB, I'm trying the following: <pre class="prettyprint"><code>1.9.3p194 :001 > foo = "\xBF".encode("utf-8", :invalid => :replace, :undef => :replace) => "\xBF" 1.9.3p194 :002 > foo.match /foo/ ArgumentError: invalid byte sequence in UTF-8 from (irb):2:in `match' </code></pre> Any ideas what's going wrong?

I'd guess that <code>"\xBF"</code> already thinks it is encoded in UTF-8 so when you call <code>encode</code>, it thinks you're trying to encode a UTF-8 string in UTF-8 and does nothing: <pre class="prettyprint"><code>>> s = "\xBF" => "\xBF" >> s.encoding => #<Encoding:UTF-8> </code></pre> <code>\xBF</code> isn't valid UTF-8 so this is, of course, nonsense. But if you use the three argument form of <code>encode</code>: <blockquote> encode(dst_encoding, src_encoding [, options] ) → str [...] The second form returns a copy of <code>str</code> transcoded from <code>src_encoding</code> to <code>dst_encoding</code>. </blockquote> You can force the issue by telling <code>encode</code> to ignore what the string thinks its encoding is and treat it as binary data: <pre class="prettyprint"><code>>> foo = s.encode('utf-8', 'binary', :invalid => :replace, :undef => :replace) => "�" </code></pre> Where <code>s</code> is the <code>"\xBF"</code> that thinks it is UTF-8 from above. You could also use <code>force_encoding</code> on <code>s</code> to force it to be binary and then use the two-argument <code>encode</code>: <pre class="prettyprint"><code>>> s.encoding => #<Encoding:UTF-8> >> s.force_encoding('binary') => "\xBF" >> s.encoding => #<Encoding:ASCII-8BIT> >> foo = s.encode('utf-8', :invalid => :replace, :undef => :replace) => "�" </code></pre>

Ruby String.encode still gives "invalid byte sequence in UTF-8"

Tags:

ruby

encoding

In IRB, I'm trying the following:

1.9.3p194 :001 > foo = "\xBF".encode("utf-8", :invalid => :replace, :undef => :replace)
 => "\xBF" 
1.9.3p194 :002 > foo.match /foo/
ArgumentError: invalid byte sequence in UTF-8
from (irb):2:in `match'

Any ideas what's going wrong?

379

asked May 05 '12 21:05

drewinglis

1 Answers

I'd guess that "\xBF" already thinks it is encoded in UTF-8 so when you call encode, it thinks you're trying to encode a UTF-8 string in UTF-8 and does nothing:

>> s = "\xBF"
=> "\xBF"
>> s.encoding
=> #<Encoding:UTF-8>

\xBF isn't valid UTF-8 so this is, of course, nonsense. But if you use the three argument form of encode:

encode(dst_encoding, src_encoding [, options] ) → str

[...] The second form returns a copy of str transcoded from src_encoding to dst_encoding.

You can force the issue by telling encode to ignore what the string thinks its encoding is and treat it as binary data:

>> foo = s.encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "�"

Where s is the "\xBF" that thinks it is UTF-8 from above.

You could also use force_encoding on s to force it to be binary and then use the two-argument encode:

>> s.encoding
=> #<Encoding:UTF-8>
>> s.force_encoding('binary')
=> "\xBF"
>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> foo = s.encode('utf-8', :invalid => :replace, :undef => :replace)
=> "�"

192

answered Nov 01 '22 15:11

mu is too short

Related questions
                            
                                How to create MD5 hash with HMAC module in Ruby?
                            
                                Using gsub to replace a particular character with a newline (Ruby, Rails console)
                            
                                Fastest/One-liner way to remove duplicates (by key) in Ruby Array?
                            
                                Do we use Rails ActiveRecord as a Hybrid Structure, i.e. Data Structure+Object?
                            
                                Zlib in Ruby to uncompress .gz
                            
                                Convert backward slash to forward slash in python
                            
                                ActiveSupport::JSON decode hash losing symbols
                            
                                Avoid printing after executing command in console
                            
                                Appending a collect array with unique values
                            
                                Macros in Ruby?
                            
                                Shortcut to make case/switch return a value
                            
                                how to encrypt and decrypt with AES CBC 128 in Elixir
                            
                                Why does the Heroku Rails App crash after upgrading Rails to 6.0.0?
                            
                                In how many languages is Null not equal to anything not even Null?
                            
                                How do I calculate a String's width in Ruby?
                            
                                Ruby syntax inside a heredoc?
                            
                                Rails Model: Name -- First, Last
                            
                                How to get ip address, referer, and user agent in ruby?
                            
                                block the creation of multiple object of a class
                            
                                should not include

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With