Presently, I have all my data files in Azure Data Lake Store. I need to process these files which are mostly in csv format. The processing would be running jobs on these files to extract various information for e.g.Data for certain periods of dates or certain events related to a scenario or adding data from multiple tables/files. These jobs run everyday through u-sql jobs in data factory(v1 or v2) and then sent to powerBI for visualization.
Using ADLA for all this processing, I feel it takes a lot of time to process and seems very expensive. I got a suggestion that I should use Azure Databricks for the above processes. Could somebody help me with this direction in the difference between the two and if it would be helpful to shift? Can I modify all my U-sql jobs into the Databricks notebook format?
From our simple example, we identified that Data Lake Analytics is more efficient when performing transformations and load operations by using runtime processing and distributed operations. On the other hand, Databricks has rich visibility using a step by step process that leads to more accurate transformations.
ADF is primarily used for Data Integration services to perform ETL processes and orchestrate data movements at scale. In contrast, Databricks provides a collaborative platform for Data Engineers and Data Scientists to perform ETL as well as build Machine Learning models under a single platform.
Azure Databricks provides the latest versions of Apache Spark and allows you to seamlessly integrate with open source libraries. Spin up clusters and build quickly in a fully managed Apache Spark environment with the global scale and availability of Azure.
Azure Data Factory is an orchestration tool for Data Integration services to carry out ETL workflows and orchestrate data transmission at scale. Azure Data Bricks provides a single collaboration platform for Data Scientists and Engineers to execute ETL and create Machine Learning models with visualization dashboards.
Disclaimer: I work for Databricks.
It is tough to give pros/cons or advice without knowing how much data you work with, what kind of data it is, or how long your processing times are. If you want to compare Azure's Data Lake Analytics costs to Databricks, it can only be accurately done through speaking with a member of the sales team.
Keep in mind that ADLA is based on YARN cluster manager(from Hadoop) and only runs U-SQL batch processing workloads. A description from blue granite:
ADLA is focused on batch processing, which is great for many Big Data workloads.
Some example uses for ADLA include, but are not limited to:
- Prepping large amounts of data for insertion into a Data Warehouse
- Processing scraped web data for science and analysis
- Churning through text, and quickly tokenizing to enable context and sentiment analysis
- Using image processing intelligence to quickly process unstructured image data
- Replacing long-running monthly batch processing with shorter running distributed processes
Databricks covers both batch and stream processing, and handles both ETL (data engineer) and Data science (Machine Learning, Deep Learning) workloads. Generally, here is why companies use Databricks.
There's more reasons than those, but those are some of the most common. You should try out a trial on the website if you think it may help your situation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With