Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why protobuf is bad for large data structures?

Tags:

I'm new to protobuf. I need to serialize complex graph-like structure and share it between C++ and Python clients. I'm trying to apply protobuf because:

  • It is language agnostic, has generators both for C++ and Python
  • It is binary. I can't afford text formats because my data structure is quite large

But Protobuf user guide says:

Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

https://developers.google.com/protocol-buffers/docs/techniques#large-data

I have graph-like structures that are sometimes up to 1 Gb in size, way above 1 Mb.

Why protobuf is bad for serializing large datasets? What should I use instead?

like image 219
random Avatar asked Nov 30 '17 01:11

random


People also ask

How large can a protobuf be?

Protobuf has a hard limit of 2GB, because many implementations use 32-bit signed arithmetic. For security reasons, many implementations (especially the Google-provided ones) impose a size limit of 64MB by default, although you can increase this limit manually if you need to.

Does protobuf compress data?

No it does not; there is no "compression" as such specified in the protobuf spec; however, it does (by default) use "varint encoding" - a variable-length encoding for integer data that means small values use less space; so 0-127 take 1 byte plus the header.

Is protobuf more efficient than JSON?

JSON is usually easier to debug (the serialized format is human-readable) and easier to work with (no need to define message types, compile them, install additional libraries, etc.). Protobuf, on the other hand, usually compresses data better and has built-in protocol documentation via the schema.

What is faster than protobuf?

Cap'n Proto is an insanely fast data interchange format and capability-based RPC system. Think JSON, except binary. Or think Protocol Buffers, except faster. In fact, in benchmarks, Cap'n Proto is INFINITY TIMES faster than Protocol Buffers.


2 Answers

It is just general guidance, so it doesn't apply to every case. For example, the OpenStreetMap project uses a protocol buffers based file format for its maps, and the files are often 10-100 GB in size. Another example is Google's own TensorFlow, which uses protobuf and the graphs it stores are often up to 1 GB in size.

However, OpenStreetMap does not have the entire file as a single message. Instead it consists of thousands individual messages, each encoding a part of the map. You can apply a similar approach, so that each message only encodes e.g. one node.

The main problem with protobuf for large files is that it doesn't support random access. You'll have to read the whole file, even if you only want to access a specific item. If your application will be reading the whole file to memory anyway, this is not an issue. This is what TensorFlow does, and it appears to store everything in a single message.

If you need a random access format that is compatible across many languages, I would suggest HDF5 or sqlite.

like image 107
jpa Avatar answered Sep 30 '22 02:09

jpa


It should be fine to use protocol buffers that are much larger than 1MB. We do it all the time at Google, and I wasn't even aware of the recommendation you're quoting.

The main problem is that you'll need to deserialize the whole protocol buffer into memory at once, so it's worth thinking about whether your data is better off broken up into smaller items so that you only have to have part of the data in memory at once.

If you can't break it up, then no worries. Go ahead and use a massive protocol buffer.

like image 33
Ken Bloom Avatar answered Sep 30 '22 01:09

Ken Bloom