I am working on a data system that needs to store large amounts of simple, extensible data (alongside some specialist indexing we are developing in-house, and not part of this question). I expect there to be billions of records stored, so efficient serialisation is a key part of the system. The serialisation needs to be fast, space-efficient, and supported in multiple platforms and languages (because packing and unpacking this data will be a client component responsibility, not part of the storage system)
The data type is effectively a hash with optional key/value pairs. Keys will be small integers (interpreted at application layer). Values can be a variety of simple data types - String, Integer, Float.
As a technology choice, we have picked MessagePack, and I am writing code to perform data serialisation via Ruby's msgpack-ruby gem.
I don't need the precision of Ruby's 64-bit Float. None of the numbers being stored has meaningful precision even to limits of 32-bit. So I want to use MessagePack support for 32-bit floating point values. This definitely exists. However, the default behaviour in Ruby on any 64-bit system is to serialise Float to 64 bits:
MessagePack.pack(10.3)
=> "\xCB@$\x99\x99\x99\x99\x99\x9A"
Looking at MessagePack code, it seems there is a method MessagePack::Packer#write_float32
, and this does what I expect:
MessagePack::DefaultFactory.packer.write_float32(10.3).to_s
=> "\xCAA$\xCC\xCD"
. . . but I cannot find a way to set up either the default packer or create a new one, that will use this method when serialising a larger structure.
As a test of my comprehension, I tried this:
class Float
def to_msgpack_ext
packer.write_float32(self)
end
def self.from_msgpack_ext s
unpacker.read(s)
end
end
MessagePack::DefaultFactory.register_type(0, Float )
MessagePack.pack(10.3)
=> "\xCB@$\x99\x99\x99\x99\x99\x9A"
No difference at all . . . clearly I am missing or misunderstanding something about the object model used in MessagePack. Is what I want to do possible, and what do I need to do?
I know it would be nice to use MessagePack.pack, but the Ruby shim is very thin. It barely gives you an entry point into the C (or Java) library. And as AnoE pointed out, I think you can only customize to_msgpack_ext
and self.from_msgpack_ext
for registered types, not built-in types.
The other problem with your attempt is that you don't have access to packer
and unpacker
from those methods. You would just have to use Array#pack
and String#unpack
, I think, even if you could figure out a way to get the library to call your methods. To get a handle to packer, you have to override a different method:
class Float
private
def to_msgpack_with_packer(packer)
packer.write_float32 self
packer
end
end
And then call it appropriately (see this code as to why):
10.3.to_msgpack(MessagePack::Packer.new).to_s # => "\xCAA$\xCC\xCD"
However, this falls apart when you call #to_msgpack
on a Hash containing a float; it just reverts to its internal methods to pack hash keys and values. This is why I said above that the Ruby shim just gives you an entry point: the core extensions are only used for the initial call.
I think the best, simplest solution is to write a little serialization function that iterates through the hash in Ruby, using the MessagePack::Packer API to do what you want when it sees a float, etc. Zero C-hacking, zero monkey-patching, zero confusion when someone tries to read your code in six months.
def pack_float32(obj, packer=MessagePack::Packer.new)
case obj
when Hash
packer.write_map_header(obj.size)
obj.each_pair do |key, value|
pack_float32(value, pack_float32(key, packer))
end
when Enumerable
packer.write_array_header(obj.size)
obj.each do |value|
pack_float32(value, packer)
end
when Float
packer.write_float32(obj)
else
packer.write(obj)
end
packer
end
pack_float32(1=>[10.3]).to_s # => "\x81\x01\x91\xCAA$\xCC\xCD"
Obviously this is not strenuously tested, and it may not handle all the edge cases, but hopefully it's enough to get you started.
One other note: You don't have to worry about unpacking. msgpack-ruby appears to correctly unpack a 32-bit float to a 64-bit Float without any fiddling on our part.
As of right now (version 1.2.4 of msgpack-ruby
) this is not possible in the exact fashion you tried: the msgpack_packer_write_value
function first checks all hard-coded data types, and handles them with its default implementation. Only if the current object does not fit any of those types are the extensions handled.
In other words: you cannot override the default pack formats with MessagePack::DefaultFactory#register_type
, calling that will simply be a no-op.
Furthermore, the extension mechanism is not what you are looking at, anyways. Using that, messagepack would emit a marker byte "this is an extension", followed by the extension ID (the value "0" in your example), followed by what is already encoded as float32 - alternatively you would need to handle the binary encoding/decoding yourself.
You could, in principle, create your own FloatX
class or whatever, but this is just a very bad move:
Float
has no new
method which you could monkeypatch, and I know of no way to tell ruby to create a FloatX
instance when you write 10.3
in your code. So you would have to do manual object creation throughout your code, probably with severe impact on performance.msgpack_packer_write_value
You would need to to override the msgpack_packer_write_value
implementation of packer.c
. Unfortunately you cannot do that in the ruby world since there is no equivalent ruby method defined for it. So the usual monkeypatching of ruby cannot be used.
Also, the method is called from plenty of other methods inside the packer.c
implementation, for example in the respective methods responsible for writing arrays or hashes. Those of course would not call a ruby method of the same name either, as they're living in their binary world completely.
Finally, whily the usage of a factory mechanism seems to imply that you can somehow create different implementations of packers, I see no evidence that this is actually true - reading the C code of the Gem, there seems to be no provision for anything of that kind. The factory seems to be there to handle the ruby<->C interactions of the Gem.
If I were in your shoes, I would clone that Gem and modify msgpack_packer_write_value
in packer.c
to behave as you wish. Check the case T_FLOAT
and go on from there. The code seems pretty straightforward - it soon proceeds to the following method in packer.h
:
static inline void msgpack_packer_write_float_value(msgpack_packer_t* pk, VALUE v)
{
msgpack_packer_write_double(pk, rb_num2dbl(v));
}
...which is of course the real culprit here.
Approaching that from the other direction (the write_float32
you already found), the comparable code is:
msgpack_packer_write_float(pk, (float)rb_num2dbl(numeric));
So if you replace that line in msgpack_packer_write_float_value
appropriately, you will be done. Should be doable even if you're not that much into C.
Afterwards, you give your Gem an individual release tag, build it yourself and specify it in your Gemfile
or however you manage your gems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With