I am new to AWS Glue and am having difficulty fully understanding the AWS docs, but am struggling through the following use case: We have an s3 bucket with a number of Avro files. We have decided to use Avro due to having extensive support for data schema changes overtime, allowing new fields to be applied to old data with no problem. With AWS Glue, I understand that a new table is created by a crawler whenever there is a schema change. When our schema has changed, this has caused a number of new tables to be created by the crawler, as expected, but not quite as we desire... Ultimately, we would like the crawler to detect the most recent schema and apply this schema to all the data that we are crawling in the s3 bucket, outputting only one table. We had (perhaps incorrectly) assumed that by using Avro, this would not be an issue as the crawler could apply new schema fields with a given default or null value to older data (the benefit of using Avro), and only output one table that we then could query using AWS Athena. Is there a way in AWS Glue to use a given schema for all data in the s3 bucket, enabling us to leverage the Avro benefit of schema evolution, so that all data is output into one table?

I haven't worked with Avro files specifically but AWS Glue lets you configure the crawler in several ways. If you create a new crawler, you'll be prompted with a few options under the "Configure the crawler's output" section. Based on your situation, I think you'll need to tick the box that says <code>Update all new and existing partitions with metadata from the table.</code> This is how that sub-menu looks like. <img src="https://i.stack.imgur.com/Aj4ep.png" alt="glue-crawler">

using AWS Glue with Apache Avro on schema changes

Tags:

amazon-web-services

amazon-s3

avro

aws-glue

I am new to AWS Glue and am having difficulty fully understanding the AWS docs, but am struggling through the following use case:

We have an s3 bucket with a number of Avro files. We have decided to use Avro due to having extensive support for data schema changes overtime, allowing new fields to be applied to old data with no problem.

With AWS Glue, I understand that a new table is created by a crawler whenever there is a schema change. When our schema has changed, this has caused a number of new tables to be created by the crawler, as expected, but not quite as we desire...

Ultimately, we would like the crawler to detect the most recent schema and apply this schema to all the data that we are crawling in the s3 bucket, outputting only one table. We had (perhaps incorrectly) assumed that by using Avro, this would not be an issue as the crawler could apply new schema fields with a given default or null value to older data (the benefit of using Avro), and only output one table that we then could query using AWS Athena.

Is there a way in AWS Glue to use a given schema for all data in the s3 bucket, enabling us to leverage the Avro benefit of schema evolution, so that all data is output into one table?

764

asked Feb 09 '18 20:02

CharStar

1 Answers

I haven't worked with Avro files specifically but AWS Glue lets you configure the crawler in several ways.

If you create a new crawler, you'll be prompted with a few options under the "Configure the crawler's output" section.

Based on your situation, I think you'll need to tick the box that says Update all new and existing partitions with metadata from the table.

This is how that sub-menu looks like.

glue-crawler

138

answered Oct 18 '22 13:10

David Gasquez

Related questions
                            
                                problem with GD image extension on Amazon Linux 2
                            
                                How to Install Postgresql 11 in Amazon Linux AMI?
                            
                                AWS SDK S3 Socket Closed exception
                            
                                AWS V4 Signing of .NET HttpClient [duplicate]
                            
                                Cloud-front backed with Nginx (which proxies to S3) randomly missing already cached items?
                            
                                How do I upgrade my Amazon Elastic Beanstalk MySQL RDS instance to 5.6?
                            
                                How do I delete from DynamoDB List of Maps, by Map attribute value in said List?
                            
                                What is the difference between volume and blockdevicemapping tags in EC2 CloudFormation
                            
                                Copy S3 Bucket including versions
                            
                                How does multi-line logging work in Lambda -> CloudWatch
                            
                                aws fargate docker container instances not able to get local hostname

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With