I am trying to leverage Athena to run SQL on data that is pre-ETL'd by a third-party vendor and pushed to an internal S3 bucket.
CSV files are pushed to the bucket daily by the ETL vendor. Each file includes yesterday's data in addition to data going back to 2016 (i.e. new data arrives daily but historical data can also change).
I have an AWS Glue Crawler set up to monitor the specific S3 folder where the CSV files are uploaded.
Because each file contains updated historical data, I am hoping to figure out a way to make the crawler overwrite the existing table based on the latest file uploaded instead of appending. Is this possible?
Thanks very much in advance!
It is not possible the way you are asking. The Crawler does not alter data.
The Crawler is populating the AWS Glue Data Catalog with tables only. Please see here for details: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If you want to do data cleaning using Athena/Glue before using data you need to follow the steps:
Map the data using Crawler into a temporary Athena database/table
Profile your data using Athena. SQL or QuickSight etc. to get the idea what you need to alter
Use Glue job to
Run one more Crawler to map cleaned data from the new S3 location into Athena database
The dedupe you are askinging about happens in step 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With