I am new to AWS Glue and am having difficulty fully understanding the AWS docs, but am struggling through the following use case:
We have an s3 bucket with a number of Avro files. We have decided to use Avro due to having extensive support for data schema changes overtime, allowing new fields to be applied to old data with no problem.
With AWS Glue, I understand that a new table is created by a crawler whenever there is a schema change. When our schema has changed, this has caused a number of new tables to be created by the crawler, as expected, but not quite as we desire...
Ultimately, we would like the crawler to detect the most recent schema and apply this schema to all the data that we are crawling in the s3 bucket, outputting only one table. We had (perhaps incorrectly) assumed that by using Avro, this would not be an issue as the crawler could apply new schema fields with a given default or null value to older data (the benefit of using Avro), and only output one table that we then could query using AWS Athena.
Is there a way in AWS Glue to use a given schema for all data in the s3 bucket, enabling us to leverage the Avro benefit of schema evolution, so that all data is output into one table?
To change the schema of a table, choose Edit schema to add and remove columns, change column names, and change data types. To compare different versions of a table, including its schema, choose Compare versions to see a side-by-side comparison of two versions of the schema for a table.
Classifier. Determines the schema of your data. AWS Glue provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others. It also provides classifiers for common relational database management systems using a JDBC connection.
You can then use this Avro schema, for example, to serialize a Java object (POJO) into bytes, and deserialize these bytes back into the Java object. Avro not only requires a schema during data serialization, but also during data deserialization.
I haven't worked with Avro files specifically but AWS Glue lets you configure the crawler in several ways.
If you create a new crawler, you'll be prompted with a few options under the "Configure the crawler's output" section.
Based on your situation, I think you'll need to tick the box that says Update all new and existing partitions with metadata from the table.
This is how that sub-menu looks like.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With