Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue Data Catalog as Metastore for external services like Databricks

Let's say, the datalake is on AWS. Using S3 as storage and Glue as data catalog. So, we can easily use athena, redshift or EMR to query data on S3 using Glue as metastore.

My question is, is it possible to expose Glue data catalog as metastore for external services like Databricks hosted on AWS ?

like image 350
Obaid Avatar asked Apr 16 '18 02:04

Obaid


People also ask

Is AWS Glue a Metastore?

The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore.

What does the AWS Glue metadata catalog service do?

The AWS Glue Data Catalog is a central repository to store structural and operational metadata for all your data assets. For a given data set, you can store its table definition, physical location, add business relevant attributes, as well as track how this data has changed over time.

What is the AWS equivalent of Databricks?

AWS EMR and Databricks provide a Cloud-based Big Data platform for data processing, interactive analysis, and building machine learning applications. Compared to traditional on-premise solutions, EMR not only runs petabyte-scale analysis at a lesser cost but is also faster than standard Apache Spark.

Does AWS have a data catalog?

The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics of your data. You use the information in the Data Catalog to create and monitor your ETL jobs. Information in the Data Catalog is stored as metadata tables, where each table specifies a single data store.


2 Answers

Now Databricks provides documentation to make Glue Data Catalog as the Metastore. It should be done following these steps:

  1. Create an IAM role and policy to access a Glue Data Catalog
  2. Create a policy for the target Glue Catalog
  3. Look up the IAM role used to create the Databricks deployment
  4. Add the Glue Catalog IAM role to the EC2 policy
  5. Add the Glue Catalog IAM role to a Databricks workspace
  6. Launch a cluster with the Glue Catalog IAM role

Reference: https://docs.databricks.com/data/metastores/aws-glue-metastore.html.

like image 99
matiasm Avatar answered Oct 14 '22 06:10

matiasm


There'd been couple of decent documentation/writeup pieces provided by Databricks (see the docs and the blog post), though they cover custom/legacy Hive metastore integration, not Glue itself.

Also - as a Plan B - it should be possible to inspect table/partition definitions you have in Databricks metastore and do one-way replication to Glue through the Java SDK (or maybe the other way around as well, mapping AWS API responses to sequences of create table / create partition statements). Of course this is ridden with rather complex corner cases, like cascading partition/table deletions and such, but for some simple create-only stuff it seems to be approachable at least.

like image 27
Anton Kraievyi Avatar answered Oct 14 '22 08:10

Anton Kraievyi