Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Governance solution for Databricks, Synapse and ADLS gen2

I'm new to data governance, forgive me if question lack some information.

Objective

We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities.

We already have about a hunder of input tables and 25 TB/yearly. In future we're expecting more.

Business has a strong requirements incline towards cloud-agnostic solutions. Still they are okay with Databricks since it's available on AWS and Azure.

Question

What is the best Data Governance solution for our stack and requirements?

My workarrounds

I haven't used any data governance solutions yet. I like AWS Data Lake solution, since it provide basic functionality out-of-the-box. AFAIK, Azure Data Catalog is outdated, because it doesn't support ADLS gen2.

After very quick googling I found three options:

  1. Databricks Privacera
  2. Databricks Immuta
  3. Apache Ranger & Apache Atlas.

Currently I'm not even sure if the 3rd option has full support for our Azure stack. Moreover, it will have much bigger development (infrastructure definition) effort. So is there any reasons I should look into Ranger/Atlas direction?

What are the reasons to prefer Privacera over Immuta and vice versa?

Are there any other options I should evaluate?

What is already done

From Data Governance perspective we have done only the following things:

  1. Define data zones inside ADLS
  2. Apply encryption/obfuscation for sensitive data (due to GDPR requirements).
  3. Implemented Row-Level Security (RLS) at Synapse and Power BI layers
  4. Custom audit framework for logging what & when was persisted

Things to be done

  1. Data lineage and single source of truth. Even at 4 months from the start, it become a pain-point to understand dependencies between data sets. The lineage information is stored inside Confluence, it's hard to maintain and continuously update in multiple places. Even now it's outdated in some places.
  2. Security. Business users may do some data exploration in Databricks Notebooks in future. We need RLS for Databricks.
  3. Data Life Cycle management.
  4. Maybe other data governance related stuff, such as data quality, etc.
like image 847
VB_ Avatar asked May 11 '20 22:05

VB_


People also ask

Can Databricks connect to synapse?

You can access Azure Synapse from Databricks using the Azure Synapse connector, a data source implementation for Apache Spark that uses Azure Blob storage, and PolyBase or the COPY statement in Azure Synapse to transfer large volumes of data efficiently between a Databricks cluster and an Azure Synapse instance.

What is the difference between Azure synapse and azure Databricks?

Azure Synapse is more suited for data analysis and for those users familiar with SQL. Databricks is more suited to streaming, ML, AI, and data science workloads courtesy of its Spark engine, which enables use of multiple languages. It isn't really a data warehouse at all.

Is Databricks a MPP?

Great SQL performance requires the MPP (massively parallel processing) architecture, and Databricks and Apache Spark were not MPP. The classic tradeoff between throughput and latency implies that a system can be great for either large queries (throughput focused) or small queries (latency focused), but not both.

Is Azure Databricks a data warehouse?

The Modern Data Warehousing with Azure Databricks course is designed to teach the fundamentals of creating clusters, developing in notebooks, and leveraging the different languages available.


2 Answers

I am currently exploring Immuta and Privacera, so I can't yet comment in detail on differences between these two. So far, Immuta gave me better impression with it's elegant policy based setup.

Still, there are ways to solve some of the issues you mentioned above without buying an external component:

1. Security

  • For RLS, consider using Table ACLs, and giving access only to certain Hive views.

  • For getting access to data inside ADLS, look at enabling password pass-through on clusters. Unfortunately, then you disable Scala.

  • You still need to setup permissions on Azure Data Lake Gen 2, which is awful experience for giving permissions on existing child items.

  • Please avoid creating dataset copies with columns/rows subsets, as data duplication is never a good idea.

2. Lineage

  • One option would be to look into Apache Atlas & Spline. Here is one example how to set this up https://medium.com/@reenugrewal/data-lineage-tracking-using-spline-on-atlas-via-event-hub-6816be0fd5c7
  • Unfortunately, Spline is still under development, even reproducing the setup mention in the article is not straight forward. Good news that Apache Atlas 3.0 has many available definitions to Azure Data Lake Gen 2 and other sources
  • In a few projects, I ended up creating custom logging of reads/writes (seems like you went on this path also). Based on these logs, I created a Power BI report to visualize the lineage.
  • Consider using Azure Data Factory for orchestration. With a proper ADF pipeline structure, you can have a high level lineage and help you see dependencies and rerun failed activities. You can read a bit more here: https://mrpaulandrew.com/2020/07/01/adf-procfwk-v1-8-complete-pipeline-dependency-chains-for-failure-handling/
  • Take a look at Marquez https://marquezproject.github.io/marquez/. Small open-source library that has some nice features, including data lineage.

3. Data quality

  • Investigate Amazon Deequ - Scala only so far but has some nice predefined data quality functions.
  • In many projects, we ended up with writing integration tests, checking data quality between moving from bronze (raw) to silver (standardized). Nothing fancy, pure PySpark.

4. Data life cycle management

  • One option is to use native data lake storage lifecycle management. That's not a viable alternative behind Delta/Parquet formats.

  • If you use Delta format, you can easier apply retention or pseudoanonymize

  • Second option, imagine that you have a table with information about all datasets (dataset_friendly_name, path, retention time, zone, sensitive_columns, owner, etc.). Your Databricks users use a small wrapper to read/write:

    DataWrapper.Read("dataset_friendly_name")

    DataWrapper.Write("destination_dataset_friendly_name")

It's up to you then to implement the logging, data loading behind the scenes. In addition you can skip sensitive_columns, acts based on retention time (both available in dataset info table). Requires quite some effort

  • You can always expand this table to more advanced schema, add extra information about pipelines, dependencies, etc. (see 2.4)

Hopefully you find something useful in my answer. It would be interesting to know which path you took.

like image 119
Valdas M Avatar answered Oct 22 '22 13:10

Valdas M


To better understand option #2 that you cited for data governance on Azure, here is a how-to tutorial demonstrating the experience of applying RLS on Databricks; a related Databricks video demo; and other data governance tutorials.

Full disclosure: My team produces content for data engineers at Immuta and I hope this helps save you some time in your research.

like image 2
Sumit Sarkar Avatar answered Oct 22 '22 12:10

Sumit Sarkar