Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Protobuf checksum (crc)

I am going to store some big objects into database (BLOB). And protobuf is, as I see it, one of the best candidates to serialize/deserialize BLOB. Despite it has binary format, it is still easy to read and to change its content (strings, integers, etc). So I need some kind of data validation, whenever its original BLOB or modified (by hacker? by too smart user?).

One possibility would be to have a dedicated field in the table, call it crc, calculate checksum of BLOB an put it there. But it would be much better (in many scenarios), when crc is a part of BLOB itself.

I can add extra bytes to the end of protobuf stream, but I will have to delete them (or deserializer will throw exception "invalid field blablabla").

I can put protobuf stream into a wrapper, but it is again overhead to unwrap/wrap.

Is there an easy and cheap way to add something to the end of protobuf stream to avoid needs of additional operations during deserialization? In XML, I could add comment. I don't think there is a comment in protobuf, but how to put CRC which will be 1 or 2 bytes to example?

like image 355
Sinatr Avatar asked Apr 01 '14 09:04

Sinatr


People also ask

Does Protobuf do compression?

No it does not; there is no "compression" as such specified in the protobuf spec; however, it does (by default) use "varint encoding" - a variable-length encoding for integer data that means small values use less space; so 0-127 take 1 byte plus the header.

Does Protobuf handle endianness?

Protocol buffers messages always use little-endian encoding. Implementations running on big-endian architectures should be doing the conversions automatically. If you are receiving data in wrong order, I would suggest using protoc --decode_raw to see whether the error occurs on the transmission or reception side.

Is Protobuf a utf8?

Protobuf uses utf-8, but that is an implementation detail that you should never see. If your concern is that it may take more bytes in utf-8 than utf-16 (for the codepoints in question), the you can always use a "bytes" type and handle the text encoding yourself.


2 Answers

Protobuf streams are appendable. If you know a field number that doesn't exist in the data, you can simply append data against that field. If you are intending to add 1 or 2 bytes of CRC data, then a "varint" is probably your best bet (note that "varint" is a 7 bit encoding format with the 8th bit a continuation marker, so you probably want to use 7, 14 or 21 bits or actual CRC data), then you can just append:

  • the chosen field number, left-shifted 3 bits, then varint encoded
  • the CRC data, varint encoded

However! The wrinkle in this is that the decoder will still often interpret and store this data, meaning that if you serialize it, it will include this data in the output.

The other approach, which avoids this, would be to encapsulate the protobuf data in some framing mechanism of your own devising. For example, you could choose to do:

  • 4 bytes to represent the protobuf payload length, "n"
  • "n" bytes of the protobuf payload
  • 2 bytes of CRC data calculated over the "n" bytes

I'd probably go with the second option. Note that you could choose "varint" encoding rather than fixed length encoding for the length prefix if you want. Probably not worth it for the CRC, though, since that will be fixed length.

like image 151
Marc Gravell Avatar answered Oct 19 '22 19:10

Marc Gravell


Crc should be saved before. This makes deserialization from stream trivial by using Seek (to skip header).

Here is simplest implementation:

// serialize
using (var file = File.Create("test.bin"))
using (var mem = new MemoryStream())
{
    Serializer.Serialize(mem, obj); // serialize obj into memory first
    // ... calculate crc
    file.Write(new byte[] { crc }, 0, 1);
    mem.WriteTo(file);
}

// deserialize
using (var file = File.OpenRead("test.bin"))
{
    var crc = file.ReadByte();
    // ... calculate and check crc
    file.Seek(1, SeekOrigin.Begin);
    Serializer.Deserialize<ObjType>(file);
}
like image 30
Sinatr Avatar answered Oct 19 '22 19:10

Sinatr