Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby character encoding when using Base64.encode

Looking at the source of Ruby's Base64.encode I cannot determine what character encoding a string is converted to, if at all, before encoding that data in Base64. A Utf-8 string encoded in Base64 is going to be a lot different than a Utf-16 string encoded in Base64. Does Ruby make any promises regarding this operation?

like image 405
Brent Avatar asked May 16 '13 19:05

Brent


People also ask

Is Base64 encoding always the same?

Artjom B. Base64 is not encryption. But yes, different input strings will always encode to different Base64-encoded strings, and the same input string will always encode to the same Base64-encoded string. It's not a hash though, so small changes in the input will only result in small changes in the output.


2 Answers

An example to encode and decode an utf-8 string in base64:

text = "intérnalionálização"
 => "intérnalionálização"
text.encoding
 => #<Encoding:UTF-8>
encoded = Base64.encode64(text)
 => "aW50w6lybmFsaW9uw6FsaXphw6fDo28=\n"
encoded.encoding
 => #<Encoding:US-ASCII>
decoded = Base64.decode64(encode)
 => "int\xC3\xA9rnalion\xC3\xA1liza\xC3\xA7\xC3\xA3o"
decoded.encoding
 => #<Encoding:US-ASCII>
decoded = decoded.force_encoding('UTF-8')
 => "intérnalionálização"
decoded.encoding
 => #<Encoding:UTF-8>
like image 121
Victor Lellis Avatar answered Oct 24 '22 15:10

Victor Lellis


The fine manual has this to say:

encode64(bin)
Returns the Base64-encoded version of bin. This method complies with RFC 2045.

Section 6.8 of RFC 2045 says:

6.8. Base64 Content-Transfer-Encoding

The Base64 Content-Transfer-Encoding is designed to represent arbitrary sequences of octets in a form that need not be humanly readable. [...]

A 65-character subset of US-ASCII is used, enabling 6 bits to be represented per printable character. (The extra 65th character, "=", is used to signify a special processing function.)

So Base64 encodes bytes into ASCII. If those bytes actually represent a UTF-8 encoded string then the UTF-8 string will be broken down into individual bytes and those bytes will be converted to Base64; for example, if you have a UTF-8 string 'µ' then you'll end up encoding the bytes 0xc2 and 0xb5 (in that order) to the Base64 representation "wrU=\n". If you start out with a binary string "\xc2\xb5" (which just happens to match the UTF-8 version of 'µ') then you'll get the same "wrU=\n" output.

When you decode "wrU=\n", you'll get the bytes "\xc2\xb5" and you'll have to know that those bytes are supposed to be UTF-8 encoded text rather than some arbitrary blob of bits. This is why you have separate content type and character set meta data attached to the Base64.

Similarly, if you have a UTF-16 string then it will be broken into bytes and those bytes will be encoded just like any other byte string. Of course this case is a little more complicated due to byte order issues but that's why we have content type and character set headers and BOMs.

The main point is that Base64 works with bytes, not characters. What format (UTF-8 text, UTF-16 text, a PNG image, ...) is someone else's problem. Base64 just converts a byte stream to a subset of US ASCII and then back to bytes; the format of those bytes must be specified separately.


I did some poking around in the source and the results might be of interest even if they're not completely relevant. The encode64 method is simply this:

def encode64(bin)
  [bin].pack("m")
end

Then if you look through Array#pack:

static VALUE
pack_pack(VALUE ary, VALUE fmt)
{
    /*...*/
    int enc_info = 1;       /* 0 - BINARY, 1 - US-ASCII, 2 - UTF-8 */

and keep an eye on enc_info, you'll see that a 'm' format will leave enc_info alone so the packed string will come out as US-ASCII and so encode64 will produce US ASCII output as expected.

like image 44
mu is too short Avatar answered Oct 24 '22 15:10

mu is too short