Can AWS Glue crawl Delta Lake table data?

Tags:

According to the article by Databricks, it is possible to integrate delta lake with AWS Glue. However, I am not sure if it is possible to do it also outside of Databricks platform. Has someone done that? Also, is it possible to add Delta Lake related metadata using Glue crawlers?

436

asked Oct 02 '19 06:10

gorros

2 Answers

This is not possible. Although you can crawl the S3 delta files outside the databrics platform but you won't find the data in the tables.

As per the doc, it says below :

Warning

Do not use AWS Glue Crawler on the location to define the table in AWS Glue. Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results.

170

answered Sep 20 '22 04:09

user3199285

I am currently using a solution to generate manifests of Delta tables using Apache Spark (https://docs.delta.io/latest/presto-integration.html#language-python).

I generate a manifest file for each Delta Table using:

deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")

Then created the table using the example below. The DDL below also creates the table inside Glue Catalog; you can then access the data from AWS Glue using Glue Data Catalog.

CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
[PARTITIONED BY (col_name2 col_datatype2, ...)]
ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<path-to-delta-table>/_symlink_format_manifest/'  -- location of 
the generated manifest

answered Sep 17 '22 04:09

dp6000

Related questions
                            
                                Spark and HBase Snapshots
                            
                                spark 1.4.0 java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
                            
                                Pyspark: shuffle RDD
                            
                                VectorAssembler output only to DenseVector?
                            
                                Spark - Shuffle Read Blocked Time
                            
                                DataFrame partitionBy on nested columns
                            
                                PySpark distributing module imports
                            
                                Spark problems with imports in Python
                            
                                Divide elements of column by a sum of elements (of same column) grouped by elements of another column
                            
                                What algorithm is used in spark decision tree (is ID3, C4.5 or CART)
                            
                                Delete files after processing with Spark Structured Streaming
                            
                                Spark build in hive MySQL metastore isn't being used
                            
                                PySpark: PicklingError: Could not serialize object: TypeError: can't pickle CompiledFFI objects
                            
                                Spark 2.2.0 - How to write/read DataFrame to DynamoDB
                            
                                PySpark Window Function: multiple conditions in orderBy on rangeBetween/rowsBetween
                            
                                best practice for debugging python-spark code
                            
                                How SBT test task manages class path and how to correctly start a Java process from SBT test
                            
                                Why spark executor cores are not equal with active tasks in spark web UI？
                            
                                The group member's supported protocols are incompatible with those of existing members
                            
                                How can I convince spark not to make an exchange when the join key is a super-set of the bucketBy key?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can AWS Glue crawl Delta Lake table data?

Tags:

amazon-s3

apache-spark

aws-glue

delta-lake

gorros

People also ask

2 Answers

user3199285

dp6000

Recent Activity

Donate For Us