Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance Metrics for Avro vs Protobuf

We are using kafka for storing messages and pushing an extremely large number of messages(> 30k in a minute). I am not sure if its relevant but the code that is the producer of the kafka message is in jruby.

Serialising and Deserialising the messages also has a performance impact on the system.

Can someone help with comparing Avro vs Protocol Buffer in terms of speed of serialisation and deserialisation.

like image 786
Aditya Sanghi Avatar asked Jul 03 '16 20:07

Aditya Sanghi


People also ask

Is Protobuf faster than Avro?

According to JMH, Protobuf can serialize some data 4.7 million times in a second where as Avro can only do 800k per second.

Which is better Avro or Protobuf?

Avro is the most compact but protobuf is just 4% bigger. Thrift is no longer an outlier for the file size in the binary formats. All implementations of protobuf have similar sizes. XML is still the most verbose so the file size is comparatively the biggest.

What is faster than Protobuf?

Cap'n Proto calls this “packing” the message; it achieves similar (better, even) message sizes to protobuf encoding, and it's still faster. When bandwidth really matters, you should apply general-purpose compression, like zlib or LZ4, regardless of your encoding format.

Is Avro faster than JSON?

AVRO with Snappy and Deflate codecs has a strong compression of 92%. Eventhought JSON-Bzip is slightly stronger, JSON-Gzip and AVRO with Snappy are three times faster.


1 Answers

I hate to tell you this, but there is no simple answer to your question.

The performance of a serialization format depends on many factors. First of all, performance is a property of implementation more than of the format itself. What you really want to know is how well do the specific JRuby implementations of each format perform (or maybe the Java implementations, if you're just wrapping them). The answer may be wildly different from the answer in other languages, like C++.

Additionally, performance will vary depending on how you use the library. Many libraries' APIs offer a trade-off between the "easy, slow" way and the "fast, hard" way. When optimizing, you'll want to carefully study the documentation and look for example code from the libraries' authors to learn about how to squeeze out maximum performance.

Finally -- and most importantly -- performance is wildly different depending on the data you are working with. Different formats and implementations optimize for different kinds of data. For instance, string-heavy data is going to exercise very different code paths from number-heavy data. For every format -- even JSON and XML* -- it's always possible to find one use case where they perform better than all the others. Be wary of benchmarks coming from the libraries' authors as these will tend to emphasize use cases favorable to them.

Unfortunately, if you really want to know which format will perform better for you, the only way you're going to find out is by writing two versions of your code, one using each library, and comparing them. No external benchmark will be able to give you the real answer.

(I'm the author of Protobuf v2 and Cap'n Proto, so I've spent a lot of time looking at serialization benchmarks and thinking about performance.)

* Just kidding about XML.

like image 183
Kenton Varda Avatar answered Oct 13 '22 19:10

Kenton Varda