Now that Spark 2.4 has built-in support for Avro format, I'm considering changing the format of some of the data sets in my data lake - those that are usually queried/joined for entire rows rather than specific column aggregations - from Parquet to Avro.
However, most of the work on top of the data is done via Spark, and to my understanding, Spark's in-memory caching and computations are done on columnar-formatted data. Does Parquet offer a performance boost in this regard, while Avro would incur some sort of data "transformation" penalty? What other considerations should I be aware of in this regard?
Both formats shine under different constraints but have things like strong types with schemas and a binary encoding in common. In its basic form it boils down to this differentiation:
As you already have your data and the ingestion process tuned to write Parquet files, it's probably best for you to stay with Parquet as long as data ingestion (latency) does not become a problem for you.
A typical usage is actually to have a mix of Parquet and Avro. Recent, freshly arrived data is stored as Avro files as this makes the data immediately available to the data lake. More historic data is transformed on e.g. a daily basis into Parquet files as they are smaller and most efficient to load but can only be written in batches. While working with this data, you would load both into Spark as a union of two tables. Thus you have the benefit of efficient reads with Parquet combined with the immediate availability of data with Avro. This pattern is often hidden by table formats like Uber's Hudi or Apache Iceberg (incubating) which was started by Netflix.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With