Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange behavior in packed Ruby strings

I'm confused by some ruby behavior. Look at the following code:

[127].pack("C") == "\x7f"   # => true

This makes sense. Now:

[128].pack("C")             # => "\x80"
"\x80"                      # => "\x80"
[128].pack("C") == "\x80"   # => false

The pack option "C" stands for 8-bit unsigned (unsigned char), which should be fine to store a value of 128. Also both strings print the same thing, so why are they not equal? Does this have something to do with encoding stuff?

I'm on ruby 2.0.0p247.

like image 880
lucas clemente Avatar asked Nov 14 '13 12:11

lucas clemente


2 Answers

It is false because the encodings differ:

[128].pack("C").encoding
#=> #<Encoding:ASCII-8BIT>
"\x80".encoding
#=> #<Encoding:UTF-8>

(using ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-linux])

In ruby 2.0 the default encoding for strings is UTF-8, but somehow pack returns an ASCII 8-Bit encoded string.

Why is [127].pack('C') == "\x79" true then?

However, [127].pack('C') == "\x79" is true, because for the code points 0 to 127 ASCII and UTF-8 do not differ. This is considered by ruby's string comparison (have a look at the rubinius source code):

def ==(other)
  [...]

  return false unless @num_bytes == other.bytesize
  return false unless Encoding.compatible?(self, other)
  return @data.compare_bytes(other.__data__, @num_bytes, other.bytesize) == 0
end

The mri c-source is similar, but harder to understand.

We observe, that the comparison checks for a compatible encoding. Let's try that:

Encoding.compatible?([127].pack("C"), "\x79") #=> #<Encoding:ASCII-8BIT>
Encoding.compatible?([128].pack("C"), "\x80") #=> nil

We see that beginning with code point 128 the comparison returns false even when both strings are made of the same bytes.

like image 71
tessi Avatar answered Nov 17 '22 07:11

tessi


In Ruby 1.9, the default source file encoding is US-ASCII. While starting from Ruby 2.0, the default encoding has changed to UTF-8. String literals like "\x80" are always encoded using the encoding of the source file that contains them.

However, the encoding of [128].pack("C") is ASCII-8BIT.

So [128].pack("C") == "\x80" is false in Ruby 2.0 while true in Ruby 1.9

Putting #coding:some_encoding in the first line of source file (or just after the shebang) can change the default source code encoding.

#coding:ascii
puts([128].pack("C") == "\x80")

Output true in Ruby 2.0 as well.

like image 1
Yu Hao Avatar answered Nov 17 '22 06:11

Yu Hao