We are using kafka for storing messages and pushing an extremely large number of messages(> 30k in a minute). I am not sure if its relevant but the code that is the producer of the kafka message is in jruby.
Serialising and Deserialising the messages also has a performance impact on the system.
Can someone help with comparing Avro vs Protocol Buffer in terms of speed of serialisation and deserialisation.
According to JMH, Protobuf can serialize some data 4.7 million times in a second where as Avro can only do 800k per second.
Avro is the most compact but protobuf is just 4% bigger. Thrift is no longer an outlier for the file size in the binary formats. All implementations of protobuf have similar sizes. XML is still the most verbose so the file size is comparatively the biggest.
Cap'n Proto calls this “packing” the message; it achieves similar (better, even) message sizes to protobuf encoding, and it's still faster. When bandwidth really matters, you should apply general-purpose compression, like zlib or LZ4, regardless of your encoding format.
AVRO with Snappy and Deflate codecs has a strong compression of 92%. Eventhought JSON-Bzip is slightly stronger, JSON-Gzip and AVRO with Snappy are three times faster.
I hate to tell you this, but there is no simple answer to your question.
The performance of a serialization format depends on many factors. First of all, performance is a property of implementation more than of the format itself. What you really want to know is how well do the specific JRuby implementations of each format perform (or maybe the Java implementations, if you're just wrapping them). The answer may be wildly different from the answer in other languages, like C++.
Additionally, performance will vary depending on how you use the library. Many libraries' APIs offer a trade-off between the "easy, slow" way and the "fast, hard" way. When optimizing, you'll want to carefully study the documentation and look for example code from the libraries' authors to learn about how to squeeze out maximum performance.
Finally -- and most importantly -- performance is wildly different depending on the data you are working with. Different formats and implementations optimize for different kinds of data. For instance, string-heavy data is going to exercise very different code paths from number-heavy data. For every format -- even JSON and XML* -- it's always possible to find one use case where they perform better than all the others. Be wary of benchmarks coming from the libraries' authors as these will tend to emphasize use cases favorable to them.
Unfortunately, if you really want to know which format will perform better for you, the only way you're going to find out is by writing two versions of your code, one using each library, and comparing them. No external benchmark will be able to give you the real answer.
(I'm the author of Protobuf v2 and Cap'n Proto, so I've spent a lot of time looking at serialization benchmarks and thinking about performance.)
* Just kidding about XML.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With