AWS Glue: Do I really need a Crawler for new content?

Tags:

What I understand from the AWS Glue docs is a craweler will help crawl and discover new data. However, I noticed that once I crawled once, if new data goes into S3, the data is actually already discovered when I query the data catalog from Athena for example. So, can I say I do not need a crawler to crawl everytime new data is added, unless there are new schemas?

In fact, if I know the schema of the files, I can just manually create the table and do without a crawler, am I correct?

790

asked Nov 03 '18 02:11

Jiew Meng

1 Answers

If data is partitioned by some keys (placed in sub-folders, like /data/year=2018/month=11/day=2) then you need a crawler to register newly added partitions (ie. /day=3) in Data Catalog to be able to query it via Athena.

However, if data is not partitined or comes into already registered partitions then there is no need to run a crawler.

Alternatively to runnig a crawler you can discover and register new partitions by running Athena command MSCK REPAIR TABLE <table> or registering them manually.

The easiest way to create a table in Data Catalog is running a crawler. But if you know schema and have patience to compose CREATE TABLE Athena query or fill all fields via AWS Glue console then you can go that way as well.

answered Nov 08 '22 23:11

Yuriy Bondaruk

Related questions
                            
                                Do we need Provisioned IOPS for RDS instance that's using 60 IOPS according to monitoring?
                            
                                Python - List files and folders in Bucket
                            
                                S3 POST request to S3 with response 405
                            
                                AWS s3 forces 302 redirects when url has no trailing slash - need 301s
                            
                                How to install Apache Zeppelin on existing Apache Spark standalone cluster
                            
                                Laravel 5 Amazon AWS S3 Error: Client error: 403 RequestTimeTooSkewed
                            
                                Automatic deployment of Docker containers on AWS ECS using Jenkins or Job Scheduler
                            
                                Robomongo can't connect: Missing expected field
                            
                                API Gateway possible to pass API key in url instead of in the header?
                            
                                Is it possible to change Amazon S3 response header `Server`?
                            
                                boto3 aws remove all inbound security group rules
                            
                                How to get aws instance metadata remotely using CLI?
                            
                                How to find AWS keypair public key?
                            
                                DynamoDB Query FilterExpression Multiple Condition Chaining Python
                            
                                AWS CloudFront Returns Access Denied from S3 Origin with Query String
                            
                                Active Storage with Amazon S3 not saving with filename specified but using file key instead
                            
                                Single Docker image push into AWS elastic container registry (ECR) from VSTS build/release definition
                            
                                AWS Cloudformation Link API Key to API Gateway
                            
                                AWS SQS doesn't reliably trigger Lambda
                            
                                AWS Api Gateway as a HTTP Proxy is currupting binary uploaded image files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Glue: Do I really need a Crawler for new content?

Tags:

amazon-web-services

aws-glue

Jiew Meng

People also ask

1 Answers

Yuriy Bondaruk

Recent Activity

Donate For Us