Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparison of streaming message implementations in protobuf

What are the trade-offs, advantages and disadvantages of each of these streaming implementations where multiple messages of the same type are encoded?

Are they any different at all ? What I want achieve is to store a vector of box'es, into a protobuf.

Impl 1 :

package foo;

message Boxes
{ 
  message Box 
  { required int32 w = 1;
    required int32 h = 2;
  }

  repeated Box boxes = 1; 
}

Impl 2:

package foo;

message Box 
{ required int32 w = 1;
  required int32 h = 2;
}

message Boxes 
{ repeated Box boxes = 1; 
}

Impl 3 : Stream multiple of these messages into the same file.

package foo;

message Box 
{ required int32 w = 1;
  required int32 h = 2;
}
like image 587
sumodds Avatar asked May 09 '13 18:05

sumodds


People also ask

Is Protobuf faster than JSON?

TL;DR — encoding and decoding string-intensive data in JavaScript is faster with JSON than it is with protobuf. When you have structured data in JavaScript, which needs to be sent over the network (for another microservice for example) or saved into a storage system, it first needs to be serialized.

What is Protobuf stream?

Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data.

Does Protobuf handle endianness?

Protocol buffers messages always use little-endian encoding. Implementations running on big-endian architectures should be doing the conversions automatically. If you are receiving data in wrong order, I would suggest using protoc --decode_raw to see whether the error occurs on the transmission or reception side.

What encoding does Protobuf use?

Protobuf strings are always valid UTF-8 strings. See the Language Guide: A string must always contain UTF-8 encoded or 7-bit ASCII text.


2 Answers

Marc Gravell answer is certainly correct, but one point he missed is

  • option's 1 & 2 (Repeated option) will serialise / deserialise all the box's at once
  • option 3 (multiple messages in the file) will serialise / deserialise box by box. If using java, you can use delimited files (which will add a Var-Int length at the start of the message).

Most of the time it will not matter wether you use a Repeated or Multiple messages, but if there are millions / billions of box's, memory will be an issue for option's 1 and 2 (Repeated) and option 3 (multiple messages in the file) would be the best to choose.

So in summary:

  • If there millions / billions of Boxes use - Option 3 (multiple messages in the file).
  • Otherwise use one of the Repeated options (1/2) because it simpler and supported across all Protocol buffers versions.

Personally I would like to see a "standard" Multiple Message format

like image 87
Bruce Martin Avatar answered Sep 24 '22 23:09

Bruce Martin


1 & 2 only change where / how the types are declared. The work itself will be identical.

3 is more interesting: you can't just stream Box after Box after Box, because the root object in protobuf is not terminated (to allow concat === merge). If you only write Boxes, when you deserialize you will have exactly one Box with the last w and h that were written. You need to add a length-prefix; you could do that arbitrarily, but: if you happen to choose to "varint"-encode the length, you're close to what the repeated gives you - except the repeated also includes a field-header (field 1, type 2 - so binary 1010 = decimal 10) before each "varint" length.

If I were you, I'd just use the repeated for simplicity. Which of 1 / 2 you choose would depend on personal choice.

like image 32
Marc Gravell Avatar answered Sep 24 '22 23:09

Marc Gravell