Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google Protocol Buffers Serialization hangs writing 1GB+ data

I am serializing a large data set using protocol buffer serialization. When my data set contains 400000 custom objects of combined size around 1 GB, serialization returns in 3~4 seconds. But when my data set contains 450000 objects of combined size around 1.2 GB, serialization call never returns and CPU is constantly consumed.

I am using .NET port of Protocol Buffers.

like image 550
muddxr Avatar asked Jun 15 '11 12:06

muddxr


People also ask

How big can a Protobuf message be?

Protobuf has a hard limit of 2GB, because many implementations use 32-bit signed arithmetic. For security reasons, many implementations (especially the Google-provided ones) impose a size limit of 64MB by default, although you can increase this limit manually if you need to.

Does Protobuf reduce size?

As you can see in the charts above, the results for the compressed environment were quite similar for both Protobuf and JSON. Protobuf messages were 9% smaller than JSON messages and they took only 4% less time to be available to the JavaScript code.

What is Protobuf serialization?

Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data.

Is Protobuf faster than JSON?

TL;DR — encoding and decoding string-intensive data in JavaScript is faster with JSON than it is with protobuf. When you have structured data in JavaScript, which needs to be sent over the network (for another microservice for example) or saved into a storage system, it first needs to be serialized.


1 Answers

Looking at the new comments, this appears to be (as the OP notes) MemoryStream capacity limited. A slight annoyance in the protobuf spec is that since sub-message lengths are variable and must prefix the sub-message, it is often necessary to buffer portions until the length is known. This is fine for most reasonable graphs, but if there is an exceptionally large graph (except for the "root object has millions of direct children" scenario, which doesn't suffer) it can end up doing quite a bit in-memory.

If you aren't tied to a particular layout (perhaps due to .proto interop with an existing client), then a simple fix is as follows: on child (sub-object) properties (including lists / arrays of sub-objects), tell it to use "group" serialization. This is not the default layout, but it says "instead of using a length-prefix, use a start/end pair of tokens". The downside of this is that if your deserialization code doesn't know about a particular object, it takes longer to skip the field, as it can't just say "seek forwards 231413 bytes" - it instead has to walk the tokens to know when the object is finished. In most cases this isn't an issue at all, since your deserialization code fully expects that data.

To do this:

[ProtoMember(1, DataFormat = DataFormat.Group)]
public SomeType SomeChild { get; set; }
....
[ProtoMember(4, DataFormat = DataFormat.Group)]
public List<SomeOtherType> SomeChildren { get { return someChildren; } }

The deserialization in protobuf-net is very forgiving (by default there is an optional strict mode), and it will happily deserialize groups in place of length-prefix, and length-prefix in place of groups (meaning: any data you have already stored somewhere should work fine).

like image 67
Marc Gravell Avatar answered Sep 23 '22 02:09

Marc Gravell