According to the article by Databricks, it is possible to integrate delta lake with AWS Glue. However, I am not sure if it is possible to do it also outside of Databricks platform. Has someone done that? Also, is it possible to add Delta Lake related metadata using Glue crawlers?
In addition, you can set a crawler configuration option to Update all new and existing partitions with metadata from the table on the AWS Glue console.
You should use AWS Glue to discover properties of the data you own, transform it, and prepare it for analytics. Glue can automatically discover both structured and semi-structured data stored in your data lake on Amazon S3, data warehouse in Amazon Redshift, and various databases running on AWS.
Glue then writes metadata from the job into the embedded AWS Glue Data Catalog. The service can automatically find an enterprise's structured or unstructured data when it is stored within data lakes in S3, data warehouses in Amazon Redshift and other databases that are part of the Amazon Relational Database Service.
To create a crawler that reads files stored on Amazon S3 On the AWS Glue service console, on the left-side menu, choose Crawlers. On the Crawlers page, choose Add crawler. This starts a series of pages that prompt you for the crawler details. In the Crawler name field, enter Flights Data Crawler , and choose Next.
This is not possible. Although you can crawl the S3 delta files outside the databrics platform but you won't find the data in the tables.
As per the doc, it says below :
Warning
Do not use AWS Glue Crawler on the location to define the table in AWS Glue. Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results.
I am currently using a solution to generate manifests of Delta tables using Apache Spark (https://docs.delta.io/latest/presto-integration.html#language-python).
I generate a manifest file for each Delta Table using:
deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Then created the table using the example below. The DDL below also creates the table inside Glue Catalog; you can then access the data from AWS Glue using Glue Data Catalog.
CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
[PARTITIONED BY (col_name2 col_datatype2, ...)]
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<path-to-delta-table>/_symlink_format_manifest/' -- location of
the generated manifest
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With