Let's say, the datalake is on AWS. Using S3 as storage and Glue as data catalog. So, we can easily use athena, redshift or EMR to query data on S3 using Glue as metastore. My question is, is it possible to expose Glue data catalog as metastore for external services like Databricks hosted on AWS ?

Now Databricks provides documentation to make Glue Data Catalog as the Metastore. It should be done following these steps: <ol> <li>Create an IAM role and policy to access a Glue Data Catalog</li> <li>Create a policy for the target Glue Catalog</li> <li>Look up the IAM role used to create the Databricks deployment</li> <li>Add the Glue Catalog IAM role to the EC2 policy</li> <li>Add the Glue Catalog IAM role to a Databricks workspace</li> <li>Launch a cluster with the Glue Catalog IAM role</li> </ol> Reference: https://docs.databricks.com/data/metastores/aws-glue-metastore.html.

AWS Glue Data Catalog as Metastore for external services like Databricks

2 Answers

Now Databricks provides documentation to make Glue Data Catalog as the Metastore. It should be done following these steps:

Create an IAM role and policy to access a Glue Data Catalog
Create a policy for the target Glue Catalog
Look up the IAM role used to create the Databricks deployment
Add the Glue Catalog IAM role to the EC2 policy
Add the Glue Catalog IAM role to a Databricks workspace
Launch a cluster with the Glue Catalog IAM role

Reference: https://docs.databricks.com/data/metastores/aws-glue-metastore.html.

answered Oct 14 '22 06:10

matiasm

There'd been couple of decent documentation/writeup pieces provided by Databricks (see the docs and the blog post), though they cover custom/legacy Hive metastore integration, not Glue itself.

Also - as a Plan B - it should be possible to inspect table/partition definitions you have in Databricks metastore and do one-way replication to Glue through the Java SDK (or maybe the other way around as well, mapping AWS API responses to sequences of create table / create partition statements). Of course this is ridden with rather complex corner cases, like cascading partition/table deletions and such, but for some simple create-only stuff it seems to be approachable at least.

answered Oct 14 '22 08:10

Anton Kraievyi

Related questions
                            
                                How can I configure Amazon S3 CORS so that Firefox will load remotely hosted webfonts?
                            
                                Loading multiple images from S3 on a Rails 4 app: slow loading page
                            
                                How to cancel file being uploaded to Amazon S3?
                            
                                Amazon AWS Cognito and Python Boto3 to establish AWS connection and upload file to Bucket
                            
                                git version control lambda functions
                            
                                Write csv file and save it into S3 using AWS Lambda (python)
                            
                                AWS CLI : Could not connect to the endpoint URL : "https://sts.amazonaws.com/"
                            
                                Move IIS Logs to AWS s3 bucket
                            
                                How can I pipe a tar compression operation to aws s3 cp?
                            
                                Public access for Active Storage in Rails 6.1
                            
                                PHP S3 upload progress
                            
                                Using django-storages and the s3boto backend: x-amz-security-token is appended which I do not want
                            
                                Saving a >>25T SchemaRDD in Parquet format on S3
                            
                                Restrict access to S3 bucket by IP without affecting IAM credentials
                            
                                How to Secure amazon s3 bucket web URL of a resource against unauthorised user
                            
                                External I18n locale path (AWS)
                            
                                Where should I store my secret keys for my Node.js app?
                            
                                Uploading file to AWS using C++ SDK
                            
                                How to download a file from GitHub Enterprise using Terraform?
                            
                                sigv4-post-example using python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Glue Data Catalog as Metastore for external services like Databricks

Tags:

amazon-s3

aws-glue

hive-metastore

databricks

data-lake

Obaid

People also ask

2 Answers

matiasm

Anton Kraievyi

Recent Activity

Donate For Us