Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it best to have many field in protobuf message or nested messages?

I tried to find some recommendations on the web but could not find anything relevant.

Let's say that I am creating a protocol buffer message that will contain a lot of fields (50+). Is it best to keep all the fields at the same level or to organize them in sub-messages? Is there any impacts on performances for one way or another?

Example:

message myMessage{
 string field1 = 1;
 string field2 = 2;
 ....
 string fieldn = n;
}

vs

message myMessage{
 SubMessage1 groupedfieldsbasedonsomebusinesslogic1 = 1;
 SubMessage2 groupedfieldsbasedonsomebusinesslogic2 = 2;

 message SubMessage1{
  string field1 = 1;
  string field2 = 2;
  ... 
  string fieldx = x;
 } 

 message SubMessage2{
  string fieldxplus1 = x+1;
  ... 
  string fieldn = n;
 }
}

I am not considering readability so much here as there are pros and cons when deserializing to have flat data or nested data. My question is really focus on the technical impacts.

like image 442
Antoine Lefebvre Avatar asked Apr 05 '18 10:04

Antoine Lefebvre


People also ask

How big can a Protobuf message be?

Protobuf has a hard limit of 2GB, because many implementations use 32-bit signed arithmetic. For security reasons, many implementations (especially the Google-provided ones) impose a size limit of 64MB by default, although you can increase this limit manually if you need to.

What is field number in Protobuf?

Field numbers are an important part of Protobuf. They're used to identify fields in the binary encoded data, which means they can't change from version to version of your service. The advantage is that backward compatibility and forward compatibility are possible.

Can you remove field from Protobuf?

Removing fields is fine, although you might want to mark it reserved so that nobody reuses it in an incompatible way. New code with old data (with the field) will silently ignore it; old code with new data will just load without the field populated, since everything in proto3 is implicitly optional .


1 Answers

There is no "best" - everything is contextual, and only you have most of the context.

However! Some minor thoughts on performance:

  • a nested approach requires more objects; usually this is fine, unless your volumes are huge
  • a nested approach may make it easier to understand the object model and the relationships between certain parts of the data
  • a flat approach requires larger field numbers; field numbers 1-15 take a single byte header; field numbers 16-2047 require 2 bytes header (and so on); in reality this extra byte for a few fields is unlikely to hurt you much, and is offset by the overhead of the alternative (nested) approach:
  • a nested approach requires a length-prefix per sub-object, or a start/end token ("group" in the protocol); this isn't much in terms of extra size, but:
    • length-prefixe requires the serializer to know the length in advance, which means either double-processing (a "compute length" sweep), or buffering; in most cases this isn't a big issue, but it may be problematic for very large sub-graphs
    • start/end tokens are something that google has been trying to kill, and is not well supported in all libraries (and IIRC it doesn't exist in "proto3" schemas); I still really like it though, in some cases :) protobuf-net (from the tags) supports the ability to encode arbitrary sub-data as groups, but it might be awkward if you need to x-plat later

Out of all of these things, the one that I would focus on if it was me is the second one.

Perhaps start with something that looks usable, and measure it for realistic data volumes; does it perform acceptably?

like image 88
Marc Gravell Avatar answered Dec 09 '22 18:12

Marc Gravell