Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark with Avro, Kryo and Parquet

I'm struggling to understand what exactly Avro, Kryo and Parquet do in the context of Spark. They all are related to serialization but I've seen them used together so they can't be doing the same thing.

Parquet describes its self as a columnar storage format and I kind of get that but when I'm saving a parquet file can Arvo or Kryo have anything to do with it? Or are they only relevant during the spark job, ie. for sending objects over the network during a shuffle or spilling to disk? How do Arvo and Kryo differ and what happens when you use them together?

like image 898
Matthew Jones Avatar asked Jun 14 '15 13:06

Matthew Jones


1 Answers

Parquet works very well when you need to read only a few columns when querying your data. However if your schema has lots of columns (30+) and in your queries/jobs you need to read all of them then record based formats (like AVRO) will work better/faster.

Another limitation of Parquet is that it is essentially write-once format. So usually you need to collect data in some staging area and write it to a parquet file once a day (for example).

This is where you might want to use AVRO. E.g. you can collect AVRO-encoded records in a Kafka topic or local files and have a batch job that converts all of them to Parquet file at the end of the day. This is fairly easy to implement thanks to parquet-avro library that provides tools to convert between AVRO and Parquet formats automatically.

And of course you can use AVRO outside of Spark/BigData. It is fairly good serialization format similar to Google Protobuf or Apache Thrift.

like image 85
kostya Avatar answered Sep 19 '22 21:09

kostya