Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to choose between Azure data lake analytics and Azure Databricks

Tags:

Azure data lake analytics and azure databricks both can be used for batch processing. Could anyone please help me understand when to choose one over another?

like image 829
Pragmatic Avatar asked May 22 '18 11:05

Pragmatic


People also ask

What is the difference between Databricks and data lake?

From our simple example, we identified that Data Lake Analytics is more efficient when performing transformations and load operations by using runtime processing and distributed operations. On the other hand, Databricks has rich visibility using a step by step process that leads to more accurate transformations.

Is Azure Databricks a data lake?

Managed Delta Lake in Azure Databricks provides a layer of reliability that enables you to curate, analyze and derive value from your data lake on the cloud.

What is the difference between Azure Databricks and Azure data Factory?

The last and most significant difference between the two tools is that ADF is generally used for data movement, ETL process, and data orchestration whereas; Databricks helps in data streaming and data collaboration in real-time.


2 Answers

In my humble opinion, a lot of it comes down to existing skillsets. If you have a team experienced in Spark, Java, Python, r or Scala then Databricks is a natural fit. If on the other hand you have a team with existing SQL and c# skills, then the learning curve for them with U-SQL will be less steep.

That aside, there are other questions which can drive out differences:

  • Do you require realtime interaction (Databricks) or batch mode analytics (both)? Although there is a feedback item for real-time interactivity for U-SQL, please vote.
  • Do you want a pay-as-you-go model (U-SQL) or clusters with auto-terminate after a certain period (Databricks)?
  • Do you like working in a notebook (Databricks) or Visual Studio / VSCode / Powershell / .net sdk (U-SQL) method?
  • Do you want to use Spark libraries like GraphX (Databricks)?
  • Do you want the ability to run and scale any runtime (U-SQL)? See here for more details.
  • Do you want a local development emulator (U-SQL)? The U-SQL emulator in Visual Studio is seamless, ie you develop your code against your local drives in the same structure as your lake (for free), then simply click the drop-down in Visual Studio to run in the cloud. Although I think you can have a local Spark environment, I'm not sure what the local (and disconnected) development experience is for Databricks.
  • Are you using ADLS Gen 2 (only Databricks)? See here.

UPDATE October 2018: As far as I am aware, U-SQL does not currently support ADLS Gen 2, which would count against it (happy to be corrected). I will update the post if and when that support is added.

UPDATE January 2019: U-SQL has not had any meaningful updates since Spring 2018.

HTH

like image 133
wBob Avatar answered Oct 01 '22 22:10

wBob


Databricks has more language options that allows professional with different skills to work on the data. Also with databricks you can run jobs with high-performance, in-memory clusters.

In a project, we use data lake more as a storage, and do all the jobs (ETL, analytics) via databricks notebook. Storing data in data lake is cheaper $.

Back to your questions, if a complex batch job, and different type of professional will work on the data you. You may choose a Azure Data Lake + Databricks architecture. Otherwise an Azure Data Lake would satisfied your needs.

Take a look of these 2 articles would help. https://databricks.com/glossary/data-lake https://visualbi.com/blogs/microsoft/azure/etl-azure-databricks-vs-data-lake-analytics/

like image 22
Wei-Hsuan Chou Avatar answered Oct 01 '22 21:10

Wei-Hsuan Chou