Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue Crawler sends all data to Glue Catalog and Athena without Glue Job

I have new to AWS Glue. I am using AWS Glue Crawler to crawl data from two S3 buckets. I have one file in each bucket. AWS Glue Crawler creates two tables in AWS Glue Data Catalog and I am also able to query the data in AWS Athena.

My understanding was in order to get data in Athena I need to create Glue job and that will pull the data in Athena but I was wrong. Is it correct to say that Glue crawler places data in Athena without the need of Glue job and if we need to push our data in DB like SQL , Oracle etc. then we need to Glue Job ?

How I can configure the Glue Crawler that it fetches only the delta data and not all data all the time from the source bucket ?

Any help is appreciated ?

like image 683
Bokambo Avatar asked Oct 13 '25 07:10

Bokambo


1 Answers

The Glue crawler is only used to identify the schema that your data is in. Your data sits somewhere (e.g. S3) and the crawler identifies the schema by going through a percentage of your files.

You then can use a query engine like Athena (managed, serverless Apache Presto) to query the data, since it already has a schema.

If you want to process / clean / aggregate the data, you can use Glue Jobs, which is basically managed serverless Spark.

like image 139
Robert Kossendey Avatar answered Oct 14 '25 20:10

Robert Kossendey



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!