Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

AWS Glue docs clearly states that Crawlers scrapes metadata information from the source (JDBS or s3) and populates Data Catalog (creates/updates DB and corresponding tables).

However, it's not clear whether we need to run a crawler regularly to detect new data in a source (ie, new objects on s3, new rows in db table) if we know that there no scheme/partitioning changes.

So, is it required to run a crawler prior to running an ETL job to be able to pick up a new data?

like image 338
Yuriy Bondaruk Avatar asked Apr 11 '18 13:04

Yuriy Bondaruk


People also ask

Is crawler is mandatory in AWS Glue?

No. you don't need to create a crawler to run Glue Job.

Does glue crawler run automatically?

AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions). If data is added to new folders (partitions), you need to reload your partitions using MSCK REPAIR TABLE mytable; .

Can an existing ETL jobs with AWS Glue run?

You can use the simple graphical interface in AWS Glue Studio to manage your ETL jobs. Using the navigation menu, choose Jobs to view the Jobs page. On this page, you can see all the jobs that you have created either with AWS Glue Studio or the AWS Glue console. You can view, manage, and run your jobs on this page.

What does a crawler do in AWS Glue?

A crawler accesses your data store, extracts metadata, and creates table definitions in the AWS Glue Data Catalog. The Crawlers pane in the AWS Glue console lists all the crawlers that you create. The list displays status and metrics from the last run of your crawler.


1 Answers

AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions).

If data is added to new folders (partitions), you need to reload your partitions using MSCK REPAIR TABLE mytable;.

like image 156
RobinL Avatar answered Oct 06 '22 23:10

RobinL