Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark + Delta Lake concepts

Tags:

I have many doubts related to Spark + Delta. enter image description here

1) Databricks propose 3 layers (bronze, silver, gold), but in which layer is recommendable to use for Machine Learning and why? I suppose they propose to have the data clean and ready in the gold layer.

2) If we abstract the concepts of these 3 layers, can we think the bronze layer as a Data Lake, the silver layer as databases, and the gold layer as a data warehouse? I mean in terms of functionality, .

3) Delta architecture is a commercial term, or is an evolution of Kappa Architecture, or is a new trending architecture as Lambda and Kappa architecture? What are the differences between (Delta + Lambda Architecture) versus Kappa Architecture?

4) In many cases Delta + Spark scale a lot more than most databases for usually much cheaper, and if we tune things right, we can get almost 2x faster queries results. I know is pretty complicated to compare the actual trending data warehouses versus the Feature/Agg Data Store, but I would like to know how can I make this comparison?

5) I used to use Kafka, Kinesis, or Event Hub for streaming process, and my question is what kind of problems can happens if we replace these tools by a Delta Lake table (I already know that everything depends of many things, but I would like to have a general vision of that).

like image 917
Eric Gabriel Bellet Locker Avatar asked May 19 '19 19:05

Eric Gabriel Bellet Locker


People also ask

What is Delta Lake in Spark?

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

What problem does Delta Lake solve?

Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements.

What is Delta Lake architecture?

Delta Lake is an open-source storage framework that enables building a. Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.

Can Delta Lake be used without Spark?

The Delta Standalone Reader (DSR) is a JVM library that allows you to read Delta Lake tables without the need to use Apache Spark; i.e. it can be used by any application that cannot run Spark.


2 Answers

1) Leave it up to your data scientists. They should be comfortable working in the silver and gold regions, some more advanced data scientists will want to go back to raw data and parse out additional information that may not have been included in the silver/gold tables.

2) Bronze = raw data in native format/delta lake format. Silver = sanitized and cleaned data in delta lake. Gold = data that is accessed via the delta lake or pushed to a data warehouse, depending on business requirements.

3) Delta architecture is an easy version of lambda architecture. Delta architecture is a commercial term at this point, we'll see if that changes in the future.

4) Delta Lake + Spark is the most scalable data storage mechanism with a reasonable price. You're welcome to test the performance based on your business requirements. Delta lake will be far cheaper than any data warehouse for storage. Your requirements around data access and latency will be the larger question.

5) Kafka, Kinesis or Eventhub are sources for getting data from the edge to the data lake. Delta lake can act as a source and sink to a streaming application. There are actually very few problems using delta as a source. The delta lake source lives on blob storage so we actually get around many problems of the infrastructure issues, but add the consistentcy issues of the blob storage. Delta lake as a source of streaming jobs is way more scalable than a kafka/kinesis/event hub, but you still need those tools to get data from the edge into the delta lake.

like image 176
Joe Widen Avatar answered Oct 19 '22 05:10

Joe Widen


  1. The medallion tables are a recommendation based on how our customers are using Delta lake. You do not have to follow it exactly; however, it does align nicely to how people design EDW's. As for machine learning and which table to use. That is going to be a choice by the folks doing machine learning. Some may want to access the Bronze tables because that is the raw data, nothing has been done to it. Others may want the Silver table because it is presumed to be clean albeit augmented. Usually the Gold tables are highly refined and specific to answering well defined business questions.

  2. Not exactly. The Bronze tables are the raw event data, e.g. one row per event or measurement, etc. The Silver tables are also at the event/measurement level, but they are highly refined and are ready to for queries, reporting, dashboards etc. The Gold table can be fact and dimension tables, aggregate tables, or curated data sets. It is important to remember that Delta is not meant to be used as a transnational, OLTP system. It is really meant for OLAP workloads.

  3. Delta architecture is a the name we gave a particular implementation of Delta Lake. It is not a commercial term per se but hopefully it becomes one. There is enough information out there to compare and contrast Kappa and Lambda architectures. The Delta architecture is well defined throughout Delta documentation and Databricks blogs, tech talks, YouTube videos, etc.

  4. I would ask exactly what it is you want to compare? Speed, features, products, ...?

  5. Delta Lake is not trying to replace any messaging pub/sub systems, they have different use cases. Delta Lake can connect to each of the product you mention both as a subscriber and publisher. Don't forget that Delta Lake is an open storage layer that bring ACID compliant transactions, high performance, and high reliability to data lakes.

Louis.

like image 43
Big Lou Avatar answered Oct 19 '22 04:10

Big Lou