Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to define nested array to ingest data and convert?

I am using Firehose and Glue to ingest data and convert JSON to the parquet file in S3.

I was successful to achieve it with normal JSON (not nested or array). But I am failed for a nested JSON array. What I have done:

the JSON structure

{
    "class_id": "test0001",
    "students": [{
        "student_id": "xxxx",
        "student_name": "AAAABBBCCC",
        "student_gpa": 123
    }]
}

the Glue schema

  1. class_id : string
  2. students : array ARRAY<STRUCT<student_id:STRING,student_name:STRING,student_gpa:INT>>

I receive error:

The schema is invalid. Error parsing the schema: Error: type expected at the position 0 of 'ARRAY<STRUCT<student_id:STRING,student_name:STRING,student_gpa:INT>>' but 'ARRAY' is found.

Any suggestion is appreciated.

like image 557
franco phong Avatar asked Nov 08 '19 14:11

franco phong


1 Answers

I ran into that because I created schemas manually in the AWS console. The problem is, that it shows some help text next to form to enter your nested data which capitalizes everything, but Parquet can only work with lowercase definitions.

Write despite the example given by AWS:

array<struct<student_id:string,student_name:string,student_gpa:int>>
like image 153
ben Avatar answered Oct 16 '22 19:10

ben