AWS Glue docs clearly states that Crawlers scrapes metadata information from the source (JDBS or s3) and populates Data Catalog (creates/updates DB and corresponding tables).
However, it's not clear whether we need to run a crawler regularly to detect new data in a source (ie, new objects on s3, new rows in db table) if we know that there no scheme/partitioning changes.
So, is it required to run a crawler prior to running an ETL job to be able to pick up a new data?
No. you don't need to create a crawler to run Glue Job.
AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions). If data is added to new folders (partitions), you need to reload your partitions using MSCK REPAIR TABLE mytable; .
You can use the simple graphical interface in AWS Glue Studio to manage your ETL jobs. Using the navigation menu, choose Jobs to view the Jobs page. On this page, you can see all the jobs that you have created either with AWS Glue Studio or the AWS Glue console. You can view, manage, and run your jobs on this page.
A crawler accesses your data store, extracts metadata, and creates table definitions in the AWS Glue Data Catalog. The Crawlers pane in the AWS Glue console lists all the crawlers that you create. The list displays status and metrics from the last run of your crawler.
AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions).
If data is added to new folders (partitions), you need to reload your partitions using MSCK REPAIR TABLE mytable;
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With