Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are two strings with same bytes and encoding not identical in Ruby 1.9?

In Ruby 1.9.2, I found a way to make two strings that have the same bytes, same encoding, and are equal, but they have a different length and different characters returned by [].

Is this a bug? If it is not a bug, then I'd like to fully understand it. What kind of information is stored inside Ruby 1.9.2 String objects that allows these two strings to behave differently?

Below is the code that reproduces this behavior. The comments that start with #=> show you what output I am getting from this script, and the parenthetical words tell you my judgment of that output.

#!/usr/bin/ruby1.9
# coding: utf-8
string1 = "\xC2\xA2"       # A well-behaved string with one character (¢)
string2 = "".concat(0xA2)  # A bizarre string very similar to string1.
p    string1.bytes.to_a    #=> [194, 162]  (good)
p    string2.bytes.to_a    #=> [194, 162]  (good)
puts string1.encoding.name #=> UTF-8  (good)
puts string2.encoding.name #=> UTF-8  (good)
puts string1 == string2    #=> true   (good)
puts string1.length        #=> 1      (good)
puts string2.length        #=> 2      (weird!)
p    string1[0]            #=> "¢"    (good)
p    string2[0]            #=> "\xC2" (weird!)

I am running Ubuntu and compiled Ruby from source. My Ruby version is:

ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
like image 789
David Grayson Avatar asked Dec 22 '22 21:12

David Grayson


2 Answers

It is Ruby's bug and fixed r29848.

like image 148
naruse Avatar answered Jan 10 '23 19:01

naruse


Matz mentioned this question via Twitter:

http://twitter.com/matz_translator/status/6597021662187520

http://twitter.com/matz_translator/status/6597055132733440

"It's hard to determine as a bug but, it's not acceptable to leave it as is. I'd prefer to fix this issue."

like image 30
Matt Aimonetti Avatar answered Jan 10 '23 18:01

Matt Aimonetti