Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading JSON data to AWS Redshift results in NULL values

I am trying to perform a load/copy operation to import data from JSON files in an S3 bucket directly to Redshift. The COPY operation succeeds, and after the COPY, the table has the correct number of rows/records, but every record is NULL !

It takes the expected amount of time for the load, the COPY command returns OK, the Redshift console reports successful and no errors... but if I perform a simple query from the table, it returns only NULL values.

The JSON is very simple + flat, and formatted correctly (according to examples I found here: http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html)

Basically, it is one row per line, formatted like:

{ "col1": "val1", "col2": "val2", ... }
{ "col1": "val1", "col2": "val2", ... }
{ "col1": "val1", "col2": "val2", ... }

I have tried things like rewriting the schema based on values and data types found in the JSON objects and also copying from uncompressed files. I thought perhaps the JSON was not being parsed correctly upon load, but it should presumably raise an error if the objects cannot be parsed.

My COPY command looks like this:

copy events from 's3://mybucket/json/prefix' 
with credentials 'aws_access_key_id=xxx;aws_secret_access_key=xxx'
json 'auto' gzip;

Any guidance would be appreciated! Thanks.

like image 675
shane Avatar asked Jun 30 '15 01:06

shane


People also ask

How do you handle null in Redshift?

If you expect a query to return null values for certain functions or columns, you can use an NVL expression to replace the nulls with some other value. For example, aggregate functions, such as SUM, return null values instead of zeroes when they have no rows to evaluate.

Does Redshift support JSON data?

Though Amazon Redshift supports JSON functions over CHAR and VARCHAR columns, we recommend using SUPER for processing data in JSON serialization format.


1 Answers

So I have discovered the cause - This would not have been evident from the description I provided in my original post.

When you create a table in Redshift, the column names are converted to lowercase. When you perform a COPY operation, the column names are case sensitive.

The input data that I have been trying to load is using camelCase for column names, and so when I perform the COPY, the columns do not match up with the defined schema (which now uses all lowercase column names)

The operation does not raise an error, though. It just leaves NULLs in all the columns that did not match (in this case, all of them)

Hope this helps somebody to avoid the same confusion!

like image 167
shane Avatar answered Sep 21 '22 09:09

shane