Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does binary encoding of AVRO compress data?

Tags:

avro

In one of our projects we are using Kafka with AVRO to transfer data across applications. Data is added to an AVRO object and object is binary encoded to write to Kafka. We use binary encoding as it is generally mentioned as a minimal representation compared to other formats.

The data is usually a JSON string and when it is saved in a file, it uses up to 10 Mb of disk. However, when the file is compressed (.zip), it uses only few KBs. We are concerned storing such data in Kafka, so trying to compress before writing to a Kafka topic.

When length of binary encoded message (i.e. length of byte array) is measured, it is proportional to the length of the data string. So I assume binary encoding is not reducing any size.

Could someone tell me if binary encoding compresses data? If not, how can I apply compression?

Thanks!

like image 419
Pal Avatar asked Nov 03 '14 09:11

Pal


2 Answers

If binary encoding compresses data?

Yes and no, it depends on your data.

According to avro binary encoding, yes for it only stores the schema once for each .avro file, regardless how many datas in that file, hence save some space w/o storing JSON's key name many times. And avro serialization do a bit compression with storing int and long leveraging variable-length zig-zag coding(only for small values). For the rest, avro don't "compress" data.

No for in some extreme case avro serialized data could be bigger than raw data. Eg. one .avro file with one Record in which only one string field. The schema overhead can defeat the saving from don't need to store the key name.

If not, how can I apply compression?

According to avro codecs, avro has built-in compression codec and optional ones. Just add one line while writing object container files :

DataFileWriter.setCodec(CodecFactory.deflateCodec(6)); // using deflate

or

DataFileWriter.setCodec(CodecFactory.snappyCodec()); // using snappy codec

To use snappy you need to include snappy-java library into your dependencies.

like image 161
zhaown Avatar answered Sep 16 '22 15:09

zhaown


If you plan to store your data on Kafka, consider using Kafka producer compression support:

ProducerConfig.set("compression.codec","snappy")

The compression is totally transparent with consumer side, all consumed messages are automatically uncompressed.

like image 39
Xuan Huy Pham Avatar answered Sep 19 '22 15:09

Xuan Huy Pham