AWS Glue: How to handle nested JSON with varying schemas

Tags:

Objective: We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum.

Background: The JSON data is from DynamoDB Streams and is deeply nested. The first level of JSON has a consistent set of elements: Keys, NewImage, OldImage, SequenceNumber, ApproximateCreationDateTime, SizeBytes, and EventName. The only variation is that some records do not have a NewImage and some don't have an OldImage. Below this first level, though, the schema varies widely.

Ideally, we would like to use Glue to only parse this first level of JSON, and basically treat the lower levels as large STRING objects (which we would then parse as needed with Redshift Spectrum). Currently, we're loading the entire record into a single VARCHAR column in Redshift, but the records are nearing the maximum size for a data type in Redshift (maximum VARCHAR length is 65535). As a result, we'd like to perform this first level of parsing before the records hit Redshift.

What we've tried/referenced so far:

Pointing the AWS Glue Crawler to the S3 bucket results in hundreds of tables with a consistent top level schema (the attributes listed above), but varying schemas at deeper levels in the STRUCT elements. We have not found a way to create a Glue ETL Job that would read from all of these tables and load it into a single table.
Creating a table manually has not been fruitful. We tried setting each column to a STRING data type, but the job did not succeed in loading data (presumably since this would involve some conversion from STRUCTs to STRINGs). When setting columns to STRUCT, it requires a defined schema - but this is precisely what varies from one record to another, so we are not able to provide a generic STRUCT schema that works for all the records in question.
The AWS Glue Relationalize transform is intriguing, but not what we're looking for in this scenario (since we want to keep some of the JSON intact, rather than flattening it entirely). Redshift Spectrum supports scalar JSON data as of a couple weeks ago, but this does not work with the nested JSON we're dealing with. Neither of these appear to help with handling the hundreds of tables created by the Glue Crawler.

Question: How would we use Glue (or some other method) to allow us to parse just the first level of these records - while ignoring the varying schemas below the elements at the top level - so that we can access it from Spectrum or load it physically into Redshift?

I'm new to Glue. I've spent quite a bit of time in the Glue documentation and looking through (the somewhat sparse) info on forums. I could be missing something obvious - or perhaps this is a limitation of Glue in its current form. Any recommendations are welcome.

Thanks!

554

asked Mar 23 '18 21:03

ehelander

Video Answer

1 Answers

I'm not sure you can do this with a table definition, but you can accomplish this with an ETL job by using a mapping function to cast the top level values as JSON strings. Documentation: [link]

import json

# Your mapping function
def flatten(rec):
    for key in rec:
        rec[key] = json.dumps(rec[key])
    return rec

old_df = glueContext.create_dynamic_frame.from_options(
    's3',
    {"paths": ['s3://...']},
    "json")

# Apply mapping function f to all DynamicRecords in DynamicFrame
new_df = Map.apply(frame=old_df, f=flatten)

From here you have the option of exporting to S3 (perhaps in Parquet or some other columnar format to optimize for querying) or directly into Redshift from my understanding, although I haven't tried it.

118

answered Oct 21 '22 09:10

x1084

Related questions
                            
                                Get table schema in Redshift
                            
                                Redshift: Serializable isolation violation on table
                            
                                can't connect to redshift database
                            
                                How does Tableau run queries on Redshift? (And/or why can't Redshift display Tableau queries?)
                            
                                Is there a way to do a SQL dump from Amazon Redshift
                            
                                Count distinct multiple columns in redshift
                            
                                How to connect Amazon Redshift to python
                            
                                Copying data from S3 to AWS redshift using python and psycopg2
                            
                                Pros & cons of BigQuery vs. Amazon Redshift [closed]
                            
                                SQL Server's isNumeric() equivalent in amazon redshift
                            
                                Handling Redshift identity columns in SQLAlchemy
                            
                                Redshift DISTKEY / SORTKEY
                            
                                Amazon Redshift : drop table if exists
                            
                                Invalid digits on Redshift
                            
                                How to get a list of UDFs in Redshift?
                            
                                How to copy csv data file to Amazon RedShift?
                            
                                How to Load Data into Amazon Redshift via Python Boto3?
                            
                                How to Insert TIMESTAMP Column into Redshift
                            
                                Redshift INSERT INTO TABLE from CTE
                            
                                How to change table schema after created in Redshift?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Glue: How to handle nested JSON with varying schemas

Tags:

amazon-redshift

aws-glue

amazon-dynamodb-streams

amazon-redshift-spectrum

ehelander

People also ask

Video Answer

1 Answers

x1084

Recent Activity

Donate For Us