Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parquet vs Delta format in Azure Data Lake Gen 2 store

I am importing fact and dimension tables from SQL Server to Azure Data Lake Gen 2.

Should I save the data as "Parquet" or "Delta" if I am going to wrangle the tables to create a dataset useful for running ML models on Azure Databricks ?

What is the difference between storing as parquet and delta ?

like image 897
learner Avatar asked Dec 16 '20 09:12

learner


People also ask

What is the difference between Delta and Parquet?

Delta Lake vs Apache Parquet: What are the differences? Delta Lake: Reliable Data Lakes at Scale. An open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads; Apache Parquet: *A free and open-source column-oriented data storage format *.

What is Parquet files in Azure data lake?

Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a far more efficient file format than CSV or JSON. For further information, see Parquet Files.

Are Delta tables faster than Parquet?

Using several techniques, Delta boasts query performance of 10 to 100 times faster than with Apache Spark on Parquet.

What is the difference between Azure data lake and Delta Lake?

But Data lake is doesn't allow ACID transactions, where as Delta lake which mostly build through data bricks does provide ACID transactions feature, I understand by using Synapse we could overcome this challenge.


Video Answer


2 Answers

Delta is storing the data as parquet, just has an additional layer over it with advanced features, providing history of events, (transaction log) and more flexibility on changing the content like, update, delete and merge capabilities. This link delta explains quite good how the files organized.

One drawback that it can get very fragmented on lots of updates, which could be harmful for performance. AS the AZ Data Lake Store Gen2 is anyway not optimized for large IO this is not really a big problem. Some optimization on the parquet format though will not be very effective this way.

I would use delta, just for the advanced features. It is very handy if there is a scenario where the data is updating over time, not just appending. Specially nice feature that you can read the delta tables as of a given point in time they existed.

SQL as of syntax

This is useful for having consistent training sets (to always have the same training dataset without separating to individual parquet files). In case for the ML models handling delta format as input may could be problematic, as likely only few frameworks will be able to read it in directly, so you will need to convert it during some pre-processing step.

like image 155
attish Avatar answered Oct 05 '22 17:10

attish


Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.

Reference : https://docs.microsoft.com/en-us/azure/databricks/delta/delta-faq

like image 38
RaHuL VeNuGoPaL Avatar answered Oct 05 '22 17:10

RaHuL VeNuGoPaL