Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding MessagePack objects containing Node.js Buffers

I'm using node-msgpack to encode and decode messages passed around between machines. One thing I'd like to be able to do is wrap raw Buffer data in an object and encode that with Messagepack.

msgpack = require('msgpack')
buf = <Buffer 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 00 00 ...>
obj = {foo: buf}
packed = msgpack.pack(obj)

In the example above, I wanted to do a consistency check on the raw bytes of buffers nested in an object. So buf was obtained like so :

var buf = fs.readFileSync('some_image.png');

In a perfect world, I would have obtained :

new Buffer(msgpack.unpack(packed).foo);

#> <Buffer 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 00 00 ...>

Instead, I end up with some random number. Digging up a little deeper, I end up with the following curiosity:

enc = 'ascii'
new Buffer(buf.toString(enc), enc)
#> <Buffer *ef bf bd* 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 00 00 ...>

buf
#> <Buffer *89* 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 00 00 02 00 ...>

The first byte is the problem. I tried using different encodings with no luck. What is happening here and what can I do to do circumvent the issue?

EDIT:

Originally, the buf was a buffer I had generated with msgpack itself, thus double-packing data. To avoid any confusion, I replaced that with another buffer obtained by reading an image, which raised the same problem.

like image 624
matehat Avatar asked Dec 18 '12 01:12

matehat


1 Answers

The buffer corruption problem occurs when binary data is decoded using any encoding text except base64 and hex. which don't seem to be picked up by node-msgpack. It seems to automatically try to use 'utf-8', which irreversibly screws up the buffer. They had to do something like that so we don't end up with a bunch of buffer objects instead of ordinary strings, which is mostly what of our msgpack objects are usually made of.


EDIT:

The three bytes that were shown above to be problematic represent the UTF-8 Replacement Character. A quick test shows that this character was to replace the unrecognizable 0x89 byte at the start :

new Buffer((new Buffer('89', 'hex')).toString('utf-8'), 'utf-8')
//> <Buffer ef bf bd>

This line of C++ code from node-msgpack is responsible for this behavior. When intercepting a Buffer instance in a data structure given to the encoder, it just bindly converts it to a String, equivalent to executing buffer.toString() which by default assumes UTF-8 encoding, replacing every unrecognizable characters with the above.

The alternative module suggested below works around this by leaving the buffer as raw bytes, not trying to convert it to a string, but by doing so is incompatible with other MessagePack implementation. If compatibility is an concern, a work around this would be to encode non-UTF-8 buffers ahead of time with a binary-safe encoding like binary, base64 or hex. base64 or hex will inevitably grow the size of the data by a significant amount, but will leave it consistent and are safest to use when transporting data across HTTP. If size is a concern as well, piping the MessagePack result through a streaming compression algorithm like Snappy can be a good option.


Turns out another module, msgpack-js (which is a msgpack encoder/decoder all written in javascript), leaves raw binary data as such, hence solving the above problem. Here's how he did it:

I've extended the format a little to allow for encoding and decoding of undefined and Buffer instances.

This required three new type codes that were previously marked as "reserved". This change means that using these new types will render your serialized data incompatible with other messagepack implementations that don't have the same extension.

As a bonus, it's also more performant than the C++ extension-based module mentionned earlier. It's also much younger, so maybe not as thoroughly tested. Time will tell. Here is the result of a quick benchmark I did, based off the one that was included in node-msgpack, comparing the two libraries (as well as native JSON parser) :

node-msgpack pack:   3793 ms
node-msgpack unpack: 1340 ms

msgpack-js pack:   3132 ms
msgpack-js unpack: 983 ms

json pack:   1223 ms
json unpack: 483 ms

So while we see a performance improvement with the native javascript msgpack decoder, JSON is still way more performant.

like image 113
matehat Avatar answered Sep 28 '22 06:09

matehat