I'm new to data governance, forgive me if question lack some information.
We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities.
We already have about a hunder of input tables and 25 TB/yearly. In future we're expecting more.
Business has a strong requirements incline towards cloud-agnostic solutions. Still they are okay with Databricks since it's available on AWS and Azure.
What is the best Data Governance solution for our stack and requirements?
I haven't used any data governance solutions yet. I like AWS Data Lake solution, since it provide basic functionality out-of-the-box. AFAIK, Azure Data Catalog is outdated, because it doesn't support ADLS gen2.
After very quick googling I found three options:
Currently I'm not even sure if the 3rd option has full support for our Azure stack. Moreover, it will have much bigger development (infrastructure definition) effort. So is there any reasons I should look into Ranger/Atlas direction?
What are the reasons to prefer Privacera over Immuta and vice versa?
Are there any other options I should evaluate?
From Data Governance perspective we have done only the following things:
You can access Azure Synapse from Databricks using the Azure Synapse connector, a data source implementation for Apache Spark that uses Azure Blob storage, and PolyBase or the COPY statement in Azure Synapse to transfer large volumes of data efficiently between a Databricks cluster and an Azure Synapse instance.
Azure Synapse is more suited for data analysis and for those users familiar with SQL. Databricks is more suited to streaming, ML, AI, and data science workloads courtesy of its Spark engine, which enables use of multiple languages. It isn't really a data warehouse at all.
Great SQL performance requires the MPP (massively parallel processing) architecture, and Databricks and Apache Spark were not MPP. The classic tradeoff between throughput and latency implies that a system can be great for either large queries (throughput focused) or small queries (latency focused), but not both.
The Modern Data Warehousing with Azure Databricks course is designed to teach the fundamentals of creating clusters, developing in notebooks, and leveraging the different languages available.
I am currently exploring Immuta and Privacera, so I can't yet comment in detail on differences between these two. So far, Immuta gave me better impression with it's elegant policy based setup.
Still, there are ways to solve some of the issues you mentioned above without buying an external component:
1. Security
For RLS, consider using Table ACLs, and giving access only to certain Hive views.
For getting access to data inside ADLS, look at enabling password pass-through on clusters. Unfortunately, then you disable Scala.
You still need to setup permissions on Azure Data Lake Gen 2, which is awful experience for giving permissions on existing child items.
Please avoid creating dataset copies with columns/rows subsets, as data duplication is never a good idea.
2. Lineage
3. Data quality
4. Data life cycle management
One option is to use native data lake storage lifecycle management. That's not a viable alternative behind Delta/Parquet formats.
If you use Delta format, you can easier apply retention or pseudoanonymize
Second option, imagine that you have a table with information about all datasets (dataset_friendly_name, path, retention time, zone, sensitive_columns, owner, etc.). Your Databricks users use a small wrapper to read/write:
DataWrapper.Read("dataset_friendly_name")
DataWrapper.Write("destination_dataset_friendly_name")
It's up to you then to implement the logging, data loading behind the scenes. In addition you can skip sensitive_columns, acts based on retention time (both available in dataset info table). Requires quite some effort
Hopefully you find something useful in my answer. It would be interesting to know which path you took.
To better understand option #2 that you cited for data governance on Azure, here is a how-to tutorial demonstrating the experience of applying RLS on Databricks; a related Databricks video demo; and other data governance tutorials.
Full disclosure: My team produces content for data engineers at Immuta and I hope this helps save you some time in your research.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With