In Ruby 1.9.2, I found a way to make two strings that have the same bytes, same encoding, and are equal, but they have a different length
and different characters returned by []
.
Is this a bug? If it is not a bug, then I'd like to fully understand it. What kind of information is stored inside Ruby 1.9.2 String objects that allows these two strings to behave differently?
Below is the code that reproduces this behavior. The comments that start with #=>
show you what output I am getting from this script, and the parenthetical words tell you my judgment of that output.
#!/usr/bin/ruby1.9
# coding: utf-8
string1 = "\xC2\xA2" # A well-behaved string with one character (¢)
string2 = "".concat(0xA2) # A bizarre string very similar to string1.
p string1.bytes.to_a #=> [194, 162] (good)
p string2.bytes.to_a #=> [194, 162] (good)
puts string1.encoding.name #=> UTF-8 (good)
puts string2.encoding.name #=> UTF-8 (good)
puts string1 == string2 #=> true (good)
puts string1.length #=> 1 (good)
puts string2.length #=> 2 (weird!)
p string1[0] #=> "¢" (good)
p string2[0] #=> "\xC2" (weird!)
I am running Ubuntu and compiled Ruby from source. My Ruby version is:
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
It is Ruby's bug and fixed r29848.
Matz mentioned this question via Twitter:
http://twitter.com/matz_translator/status/6597021662187520
http://twitter.com/matz_translator/status/6597055132733440
"It's hard to determine as a bug but, it's not acceptable to leave it as is. I'd prefer to fix this issue."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With