What are the trade-offs, advantages and disadvantages of each of these streaming implementations where multiple messages of the same type are encoded?
Are they any different at all ? What I want achieve is to store a vector of box'es, into a protobuf.
Impl 1 :
package foo;
message Boxes
{
message Box
{ required int32 w = 1;
required int32 h = 2;
}
repeated Box boxes = 1;
}
Impl 2:
package foo;
message Box
{ required int32 w = 1;
required int32 h = 2;
}
message Boxes
{ repeated Box boxes = 1;
}
Impl 3 : Stream multiple of these messages into the same file.
package foo;
message Box
{ required int32 w = 1;
required int32 h = 2;
}
TL;DR — encoding and decoding string-intensive data in JavaScript is faster with JSON than it is with protobuf. When you have structured data in JavaScript, which needs to be sent over the network (for another microservice for example) or saved into a storage system, it first needs to be serialized.
Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data.
Protocol buffers messages always use little-endian encoding. Implementations running on big-endian architectures should be doing the conversions automatically. If you are receiving data in wrong order, I would suggest using protoc --decode_raw to see whether the error occurs on the transmission or reception side.
Protobuf strings are always valid UTF-8 strings. See the Language Guide: A string must always contain UTF-8 encoded or 7-bit ASCII text.
Marc Gravell answer is certainly correct, but one point he missed is
Most of the time it will not matter wether you use a Repeated or Multiple messages, but if there are millions / billions of box's, memory will be an issue for option's 1 and 2 (Repeated) and option 3 (multiple messages in the file) would be the best to choose.
So in summary:
Personally I would like to see a "standard" Multiple Message format
1 & 2 only change where / how the types are declared. The work itself will be identical.
3 is more interesting: you can't just stream Box
after Box
after Box
, because the root object in protobuf is not terminated (to allow concat === merge). If you only write Box
es, when you deserialize you will have exactly one Box
with the last w
and h
that were written. You need to add a length-prefix; you could do that arbitrarily, but: if you happen to choose to "varint"-encode the length, you're close to what the repeated
gives you - except the repeated
also includes a field-header (field 1, type 2 - so binary 1010 = decimal 10) before each "varint" length.
If I were you, I'd just use the repeated
for simplicity. Which of 1 / 2 you choose would depend on personal choice.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With