AWS Glue Crawler Overwrite Data vs. Append

Question

I am trying to leverage Athena to run SQL on data that is pre-ETL'd by a third-party vendor and pushed to an internal S3 bucket.

CSV files are pushed to the bucket daily by the ETL vendor. Each file includes yesterday's data in addition to data going back to 2016 (i.e. new data arrives daily but historical data can also change).

I have an AWS Glue Crawler set up to monitor the specific S3 folder where the CSV files are uploaded.

Because each file contains updated historical data, I am hoping to figure out a way to make the crawler overwrite the existing table based on the latest file uploaded instead of appending. Is this possible?

Thanks very much in advance!

Alex Skorokhod · Accepted Answer

It is not possible the way you are asking. The Crawler does not alter data.

The Crawler is populating the AWS Glue Data Catalog with tables only. Please see here for details: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

If you want to do data cleaning using Athena/Glue before using data you need to follow the steps:

Map the data using Crawler into a temporary Athena database/table
Profile your data using Athena. SQL or QuickSight etc. to get the idea what you need to alter
Use Glue job to
- make data transformation/cleaning/renaming/deduping using PySpark or Scala
- export data into S3 new location (.csv / .paruqet etc.) potentially partitioning
Run one more Crawler to map cleaned data from the new S3 location into Athena database

The dedupe you are askinging about happens in step 3

AWS Glue Crawler Overwrite Data vs. Append

Tags:

amazon-web-services

amazon-s3

aws-glue

Christopher Taylor

1 Answers

Alex Skorokhod

Recent Activity

Donate For Us

AWS Glue Crawler Overwrite Data vs. Append

Tags:

amazon-web-services

amazon-s3

aws-glue

Christopher Taylor

1 Answers

Alex Skorokhod

Related questions

Recent Activity

Donate For Us