I recently started working with encoding in Ruby, and am confused by some behavior.
I'm using 2.2.3p173 and am showing the following:
__ENCODING__ #=> #<Encoding:UTF-8> Default encoding in 2.2.3
"my_string".encoding #=> #<Encoding:UTF-8>
Object.to_s.encoding #=> #<Encoding:US-ASCII>
Object.new.to_s.encoding #=> #<Encoding:ASCII-8BIT>
What's the cause of this discrepancy in encodings?
Nice find!
The short answer is it's completely arbitrary and it depends on how Ruby internally builds the strings that are being returned.
There are a whole host of internal C functions that construct empty strings or literal strings with US-ASCII encoding: rb_usascii_str_new
and similar. They're frequently used to construct strings by appending smaller fragments of strings. Almost every to_s
method does this:
[].to_s.encoding
#<Encoding:US-ASCII>
{}.to_s.encoding
#<Encoding:US-ASCII>
$/.to_s.encoding
#<Encoding:US-ASCII>
1.to_s.encoding
#<Encoding:US-ASCII>
true.to_s.encoding
#<Encoding:US-ASCII>
Object.to_s.encoding
#<Encoding:US-ASCII>
So why not Object.new.to_s
? The key here is that Object#to_s
is the fallback to_s
method for every class, so in order to make it generic-yet-informative they coded it to output the value of the object's internal pointer. The easiest way to do that is with sprintf
and the %p
specifier. BUT whoever coded Ruby's sprintf
wrapper rb_sprintf
got lazy and just set the encoding to NULL
which falls back to ASCII-8BIT
. So generally anything that returns a formatted string will have this encoding:
Object.new.to_s
#<Encoding:ASCII-8BIT>
nil.sort rescue $!.to_s.encoding
#<Encoding:ASCII-8BIT>
[].each.to_s.encoding
#<Encoding:ASCII-8BIT>
As for strings defined by a script, those get the default encoding UTF-8 as you would expect.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With