Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What the best ways to use decimals and datetimes with protocol buffers?

I would like to find out what is the optimum way of storing some common data type that were not included in the list supported by protocol buffers.

  • datetime (seconds precision)
  • datetime (milliseconds precision)
  • decimals with fixed precision
  • decimals with variable precision
  • lots of bool values (if you have lots of them it looks like you'll have 1-2 bytes overhead for each of them due to their tags.

Also the idea is to map them very easy to corresponding C++/Python/Java data types.

like image 906
sorin Avatar asked Sep 30 '09 09:09

sorin


People also ask

What is the use of protocol buffers?

Protocol buffers provide a language-neutral, platform-neutral, extensible mechanism for serializing structured data in a forward-compatible and backward-compatible way. It's like JSON, except it's smaller and faster, and it generates native language bindings.

What is protocol buffers gRPC?

Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.

What is oneof in Protobuf?

Protocol Buffer (Protobuf) provides two simpler options for dealing with values that might be of more than one type. The Any type can represent any known Protobuf message type. And you can use the oneof keyword to specify that only one of a range of fields can be set in any message.


4 Answers

The protobuf design rationale is most likely to keep data type support as "native" as possible, so that it's easy to adopt new languages in future. I suppose they could provide in-build message types, but where do you draw the line?

My solution was to create two message types:

DateTime
TimeSpan

This is only because I come from a C# background, where these types are taken for granted.

In retrospect, TimeSpan and DateTime may have been overkill, but it was a "cheap" way of avoiding conversion from h/m/s to s and vice versa; that said, it would have been simple to just implement a utility function such as:

int TimeUtility::ToSeconds(int h, int m, int s)

Bklyn, pointed out that heap memory is used for nested messages; in some cases this is clearly very valid - we should always be aware of how memory is used. But, in other cases this can be of less concern, where we're worried more about ease of implementation (this is the Java/C# philosophy I suppose).

There's also a small disadvantage to using non-intrinsic types with the protobuf TextFormat::Printer; you cannot specify the format in which it is displayed, so it'll look something like:

my_datetime {
    seconds: 10
    minutes: 25
    hours: 12
}

... which is too verbose for some. That said, it would be harder to read if it were represented in seconds.

To conclude, I'd say:

  • If you're worried about memory/parsing efficiency, use seconds/milliseconds.
  • However, if ease of implementation is the objective, use nested messages (DateTime, etc).
like image 71
Nick Bolton Avatar answered Sep 25 '22 07:09

Nick Bolton


Here are some ideas based on my experience with a wire protocol similar to Protocol Buffers.

datetime (seconds precision)

datetime (milliseconds precision)

I think the answer to these two would be the same, you would just typically be dealing with a smaller range of numbers in the case of seconds precision.

Use a sint64/sfixed64 to store the offset in seconds/milliseconds from some well-known epoch like midnight GMT 1/1/1970. This how Date objects are internally represented in Java. I'm sure there are analogs in Python and C++.

If you need time zone information, pass around your date/times in terms of UTC and model the pertinent time zone as a separate string field. For that, you can use the identifiers from the Olson Zoneinfo database since that has become somewhat standard.

This way you have a canonical representation for date/time, but you can also localize to whatever time zone is pertinent.

decimals with fixed precision

My first thought is to use a string similar to how one constructs Decimal objects from Python's decimal package. I suppose that could be inefficient relative to some numerical representation.

There may be better solutions depending on what domain you're working with. For example, if you're modeling a monetary value, maybe you can get away with using a uint32/64 to communicate the value in cents as opposed to fractional dollar amounts.

There are also some useful suggestions in this thread.

decimals with variable precision

Doesn't Protocol Buffers already support this with float/double scalar types? Maybe I've misunderstood this bullet point.

Anyway, if you had a need to go around those scalar types, you can encode using IEEE-754 to uint32 or uint64 (float vs double respectively). For example, Java allows you to extract the IEEE-754 representation and vice versa from Float/Double objects. There are analogous mechanisms in C++/Python.

lots of bool values (if you have lots of them it looks like you'll have 1-2 bytes overhead for each of them due to their tags.

If you are concerned about wasted bytes on the wire, you could use bit-masking techniques to compress many booleans into a single uint32 or uint64.

Because there isn't first class support in Protocol Buffers, all of these techniques require a bit of a gentlemens' contract between agents. Perhaps using a naming convention on your fields like "_dttm" or "_mask" would help communicate when a given field has additional encoding semantics above and beyond the default behavior of Protocol Buffers.

like image 43
Joe Holloway Avatar answered Sep 25 '22 07:09

Joe Holloway


Sorry, not a complete answer, but a "me too".

I think this is a great question, one I'd love an answer to myself. The inability to natively describe fundamental types like datetimes and (for financial applications) fixed point decimals, or map them to language-specified or user-defined types is a real killer for me. Its more or less prevented me from being able to use the library, which I otherwise think is fantastic.

Declaring your own "DateTime" or "FixedPoint" message in the proto grammar isn't really a solution, because you'll still need to convert your platform's representation to/from the generated objects manually, which is error prone. Additionally, these nested messages get stored as pointers to heap-allocated objects in C++, which is wildly inefficient when the underlying type is basically just a 64-bit integer.

Specifically, I'd want to be able to write something like this in my proto files:

message Something {
   required fixed64 time = 1 [cpp_type="boost::posix_time::ptime"];
   required int64 price = 2 [cpp_type="fixed_point<int64_t, 4>"];
   ...
 };

And I would be required to provide whatever glue was necessary to convert these types to/from fixed64 and int64 so that the serialization would work. Maybe thru something like adobe::promote?

like image 28
Bklyn Avatar answered Sep 22 '22 07:09

Bklyn


For datetime with millisecond resolution I used an int64 that has the datetime as YYYYMMDDHHMMSSmmm. This makes it both concise and readable, and surprisingly, will last a very long time.

For decimals, I used byte[], knowing that there's no better representation that won't be lossy.

like image 36
ytoledano Avatar answered Sep 25 '22 07:09

ytoledano