Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can AWS Glue crawl Delta Lake table data?

According to the article by Databricks, it is possible to integrate delta lake with AWS Glue. However, I am not sure if it is possible to do it also outside of Databricks platform. Has someone done that? Also, is it possible to add Delta Lake related metadata using Glue crawlers?

like image 436
gorros Avatar asked Oct 02 '19 06:10

gorros


People also ask

Can glue crawler update existing table?

In addition, you can set a crawler configuration option to Update all new and existing partitions with metadata from the table on the AWS Glue console.

Which type of data can be discover by AWS Glue?

You should use AWS Glue to discover properties of the data you own, transform it, and prepare it for analytics. Glue can automatically discover both structured and semi-structured data stored in your data lake on Amazon S3, data warehouse in Amazon Redshift, and various databases running on AWS.

Does AWS Glue support unstructured data?

Glue then writes metadata from the job into the embedded AWS Glue Data Catalog. The service can automatically find an enterprise's structured or unstructured data when it is stored within data lakes in S3, data warehouses in Amazon Redshift and other databases that are part of the Amazon Relational Database Service.

How do you make a crawler from AWS Glue?

To create a crawler that reads files stored on Amazon S3 On the AWS Glue service console, on the left-side menu, choose Crawlers. On the Crawlers page, choose Add crawler. This starts a series of pages that prompt you for the crawler details. In the Crawler name field, enter Flights Data Crawler , and choose Next.


2 Answers

This is not possible. Although you can crawl the S3 delta files outside the databrics platform but you won't find the data in the tables.

As per the doc, it says below :

Warning

Do not use AWS Glue Crawler on the location to define the table in AWS Glue. Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results.

like image 170
user3199285 Avatar answered Sep 20 '22 04:09

user3199285


I am currently using a solution to generate manifests of Delta tables using Apache Spark (https://docs.delta.io/latest/presto-integration.html#language-python).

I generate a manifest file for each Delta Table using:

deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")

Then created the table using the example below. The DDL below also creates the table inside Glue Catalog; you can then access the data from AWS Glue using Glue Data Catalog.

CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
[PARTITIONED BY (col_name2 col_datatype2, ...)]
ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<path-to-delta-table>/_symlink_format_manifest/'  -- location of 
the generated manifest
like image 24
dp6000 Avatar answered Sep 17 '22 04:09

dp6000