I've recently read an article about protocol buffers,
Protocol Buffers is a method of serializing structured data. It is
useful in developing programs to communicate with each other over a
wire or for storing data. The method involves an interface description
language that describes the structure of some data and a program that
generates source code from that description for generating or parsing
a stream of bytes that represents the structured data
What I want to know is, where to use them? Are there any real-life examples rather than simple address book examples? Is it for example used to pre-cache query results from databases?
Protocol buffers are a data storage and exchange format, notably used for RPC - communication between programs or computers.
Alternatives include language-specific serialization (Java serialization, Python pickles, etc.), tabular formats like CSV and TSV, structured text formats like XML and JSON, and other binary formats like Apache Thrift. Conceptually these are all just different ways of representing structured data, but in practice they have different pros and cons.
Protocol buffers are:
- Space efficient, relying on a custom format to represent data compactly.
- Provide strong type safety cross-language (particularly in strongly-typed languages like Java, but even in Python it's still quite useful).
- Designed to be backwards and forwards-compatible. It's easy to make structural changes to protocol buffers (normally adding new fields or deprecating old ones) without needing to ensure all applications using the proto are updated simultaneously.
- Somewhat tedious to work with manually. While there is a text format, it is mostly useful for manually inspecting, not storing, protos. JSON, for instance, is much easier for a human to write and edit. Therefore protos are usually written and read by programs.
- Dependent on a
.proto
compiler. By separating the structure from the data protocol buffers can be lean and mean, but it means without an associated .proto
file and a tool like protoc
to generate code to parse it, arbitrary data in proto format is unusable. This makes protos a poor choice for sending data to other people who may not have the .proto
file on hand.
To make some sweeping generalizations about different formats:
- CSV/TSV/etc. are useful for human-constructed data that never needs to be transmitted between people or programs. It's easy to construct and easy to parse, but a nightmare to keep in sync and can't easily represent complex structures.
- Language-specific serialization like pickles can be useful for short-lived serialization, but quickly runs into backwards compatibility issues and obviously limit you to one language. Except in some very specific cases protobufs accomplish all the same goals with more safety and better future-proofing.
- JSON is ideal for sending data between different parties (e.g. public APIs). Because the structure and the content are transmitted together anyone can understand it, and it's easy to parse in all major languages. There's little reason nowadays to use other structured formats like XML.
- Binary formats like Protocol Buffers are ideal for almost all other data serialization use cases; long and short-term storage, inter-process communication, intra-process and application-wide caching, and more.
Google famously uses protocol buffers for practically everything they do. If you can imagine a reason to need to store or transmit data, Google probably does it with protocol buffers.
I used them to create a financial trading system. Here are the reasons:
- There's libraries for many languages. Some things needed to be in c++, others in c#. And it was open to extending to Python or Java, etc.
- It needed to be fast to serialize/deserialize and compact. This is due to the speed requirement in the financial trading system. The messages were quite a lot shorter than comparable text type messages, which meant you never had a problem fitting them in one network packet.
- It didn't need to be readable off the wire. Previously the system had XML which is nice for debugging, but you can get debugging outputs in other ways and turn them off in production.
- It gives your message a natural structure, and an API for getting the parts you need. Writing something custom would have required thinking about all the helper functions to pull numbers out of the binary, with corner cases and all that.