AWS Glue Crawler Creates Partition and File Tables

Tags:

I have a pretty basic s3 setup that I would like to query against using Athena. The data is all stored in one bucket, organized into year/month/day/hour folders.

|--data
|   |--2018
|   |   |--01
|   |   |   |--01
|   |   |   |   |--01
|   |   |   |   |   |--file1.json
|   |   |   |   |   |--file2.json
|   |   |   |   |--02
|   |   |   |   |   |--file3.json
|   |   |   |   |   |--file4.json
...

I then setup an AWS Glue Crawler to crawl s3://bucket/data. The schema in all files is identical. I would expect that I would get one database table, with partitions on the year, month, day, etc.

What I get instead are tens of thousands of tables. There is a table for each file, and a table for each parent partition as well. So far as I can tell, separate tables were created for each file/folder, without a single overarching one where I can query across a large date range.

I followed instructions https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html to the best of my ability, but cannot figure out how to structure my partitions/scanning such that I don't get this huge, mostly worthless dump of data.

865

asked Jun 29 '18 17:06

zachd1_618

1 Answers

Glue Crawler leaves a lot to be desired. It's promises to solve a lot of situations, but is really limited in what it actually supports. If your data is stored in directories and does not use Hive-style partitioning (e.g. year=2019/month=02/file.json) it will more often than not mess up. It's especially frustrating when the data is produced by other AWS products, like Kinesis Firehose, which it looks like your data could be.

Depending on how much data you have I might start by just creating an unpartitioned Athena table that pointed to the root of the structure. It's only once your data grows beyond multiple gigabytes or thousands of files that partitioning becomes important.

Another strategy you could employ is to add a Lambda function that gets triggered by an S3 notification whenever a new object lands in your bucket. The function could look at the key and figure out which partition it belongs to and use the Glue API to add that partition to the table. Adding a partition that already exists will return an error from the API, but as long as your function catches it and ignores it you will be fine.

159

answered Nov 15 '22 04:11

Theo

Related questions
                            
                                com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: XXXXXXXX)
                            
                                AWS access keys (for CLI authentication, etc..) for users from a SAML identity provider, or AD connector?
                            
                                AmazonServiceException class not found
                            
                                Slow PostgreSQL sequential scans on RDS?
                            
                                AWS: How to properly authenticate a user against Cognito Pool and use it for Cognito Federated Identity?
                            
                                Install custom plugin for Kibana on AWS ElasticSearch Instance
                            
                                Mutual Authentication (2-way SSL) in AWS Lambda
                            
                                Scheduling reports & data driven alerts in Amazon Quicksight [closed]
                            
                                Cannot find ODBC driver in AWS Lambda when using unixODBC
                            
                                Anyone experienced data lost when using AWS kinesis streams, lambda and firehose?
                            
                                How do I identify what IAM permissions are required for AWS CloudFormation?
                            
                                AWS lambda add PATH variable?
                            
                                PhantomJS in AWS Lambda (Missing libfontconfig)
                            
                                Why I get ElasticBeanstalk::ExternalInvocationError?
                            
                                How to use aws cognito to share session across apps for seamless user experience on device?
                            
                                Getting the error in using Terraform for AWS: "The new key policy will not allow you to update the key policy in the future."
                            
                                DIfferent Cognito Pool Authorizer by Api Gateway Stages
                            
                                API Gateway Proxy Without URL Redirection
                            
                                AWS NLB warning: There is an Internet Gateway attached to these subnets
                            
                                Difference between Zookeeper and a managed replicated database service

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Glue Crawler Creates Partition and File Tables

Tags:

amazon-web-services

amazon-s3

amazon-athena

aws-glue

zachd1_618

People also ask

1 Answers

Theo

Recent Activity

Donate For Us