AWS Glue docs clearly states that Crawlers scrapes metadata information from the source (JDBS or s3) and populates Data Catalog (creates/updates DB and corresponding tables). However, it's not clear whether we need to run a crawler regularly to detect new data in a source (ie, new objects on s3, new rows in db table) if we know that there no scheme/partitioning changes. So, is it required to run a crawler prior to running an ETL job to be able to pick up a new data?

AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions). If data is added to new folders (partitions), you need to reload your partitions using <code>MSCK REPAIR TABLE mytable;</code>.

Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

1 Answers

AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions).

If data is added to new folders (partitions), you need to reload your partitions using MSCK REPAIR TABLE mytable;.

156

answered Oct 06 '22 23:10

RobinL

Related questions
                            
                                Amazon S3 file download through curl by using IAM user credentials
                            
                                AWS ElasticBeanstalk ENV Vars not working
                            
                                How to structure AWS Elastic Beanstalk production and staging environments with web and worker tiers?
                            
                                How to config Meteor on AWS/EBS using METEOR_SETTINGS environment variable
                            
                                Terraform initial state file creation
                            
                                Amazon RDS: Restore snapshot without backing up right after
                            
                                How to call AWS API Gateway Endpoint with Cognito Id (+configuration)?
                            
                                Is there any alternative for WebJobs in AWS (like in Azure)?
                            
                                Elasticsearch : Meaning of "@" symbol
                            
                                AWS Elastic Beanstalk - add load balancer to app retroactively
                            
                                Least privilege AWS IAM policy for cloudformation
                            
                                Do I get Amazon SES Free-Tier Pricing when I send emails from Heroku?
                            
                                How to create CloudWatch logs trigger for AWS Lambda using aws ruby SDK?
                            
                                Custom 404 Page for Static Website using AWS S3 buckets not working
                            
                                S3 and EMR data locality [closed]
                            
                                AWS Boto3: The security token included in the request is invalid
                            
                                Using Terraform to manage multiple AWS regions
                            
                                Cloudformation & Parameter Store: How to select parameter for the environment
                            
                                AWS X-Ray GoLang Lambda to lambda tracing and displayed in the service map
                            
                                AWS Cognito - Prevent Password Reuse

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

Tags:

amazon-web-services

aws-glue

Yuriy Bondaruk

People also ask

1 Answers

RobinL

Recent Activity

Donate For Us