Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String Encoding in Ruby

Tags:

ruby

encoding

I recently started working with encoding in Ruby, and am confused by some behavior.

I'm using 2.2.3p173 and am showing the following:

__ENCODING__             #=> #<Encoding:UTF-8>  Default encoding in 2.2.3

"my_string".encoding     #=> #<Encoding:UTF-8>
Object.to_s.encoding     #=> #<Encoding:US-ASCII>
Object.new.to_s.encoding #=> #<Encoding:ASCII-8BIT>

What's the cause of this discrepancy in encodings?

like image 768
garythegoat Avatar asked Nov 14 '15 02:11

garythegoat


1 Answers

Nice find!

The short answer is it's completely arbitrary and it depends on how Ruby internally builds the strings that are being returned.

There are a whole host of internal C functions that construct empty strings or literal strings with US-ASCII encoding: rb_usascii_str_new and similar. They're frequently used to construct strings by appending smaller fragments of strings. Almost every to_s method does this:

[].to_s.encoding
#<Encoding:US-ASCII>
{}.to_s.encoding
#<Encoding:US-ASCII>
$/.to_s.encoding
#<Encoding:US-ASCII>
1.to_s.encoding
#<Encoding:US-ASCII>
true.to_s.encoding
#<Encoding:US-ASCII>
Object.to_s.encoding
#<Encoding:US-ASCII>

So why not Object.new.to_s? The key here is that Object#to_s is the fallback to_s method for every class, so in order to make it generic-yet-informative they coded it to output the value of the object's internal pointer. The easiest way to do that is with sprintf and the %p specifier. BUT whoever coded Ruby's sprintf wrapper rb_sprintf got lazy and just set the encoding to NULL which falls back to ASCII-8BIT. So generally anything that returns a formatted string will have this encoding:

Object.new.to_s
#<Encoding:ASCII-8BIT>
nil.sort rescue $!.to_s.encoding
#<Encoding:ASCII-8BIT>
[].each.to_s.encoding
#<Encoding:ASCII-8BIT>

As for strings defined by a script, those get the default encoding UTF-8 as you would expect.

like image 197
Max Avatar answered Nov 02 '22 19:11

Max