Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue Crawler Overwrite Data vs. Append

I am trying to leverage Athena to run SQL on data that is pre-ETL'd by a third-party vendor and pushed to an internal S3 bucket.

CSV files are pushed to the bucket daily by the ETL vendor. Each file includes yesterday's data in addition to data going back to 2016 (i.e. new data arrives daily but historical data can also change).

I have an AWS Glue Crawler set up to monitor the specific S3 folder where the CSV files are uploaded.

Because each file contains updated historical data, I am hoping to figure out a way to make the crawler overwrite the existing table based on the latest file uploaded instead of appending. Is this possible?

Thanks very much in advance!

like image 714
Christopher Taylor Avatar asked Mar 28 '26 23:03

Christopher Taylor


1 Answers

It is not possible the way you are asking. The Crawler does not alter data.

The Crawler is populating the AWS Glue Data Catalog with tables only. Please see here for details: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

If you want to do data cleaning using Athena/Glue before using data you need to follow the steps:

  1. Map the data using Crawler into a temporary Athena database/table

  2. Profile your data using Athena. SQL or QuickSight etc. to get the idea what you need to alter

  3. Use Glue job to

    • make data transformation/cleaning/renaming/deduping using PySpark or Scala
    • export data into S3 new location (.csv / .paruqet etc.) potentially partitioning
  4. Run one more Crawler to map cleaned data from the new S3 location into Athena database

The dedupe you are askinging about happens in step 3

like image 131
Alex Skorokhod Avatar answered Mar 31 '26 04:03

Alex Skorokhod



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!