Pyspark: Create Schema from Json Schema involving Array columns

Tags:

I have defined my schema for the df in a json file as follows:

{
    "table1":{
        "fields":[
            {"metadata":{}, "name":"first_name", "type":"string", "nullable":false},
            {"metadata":{}, "name":"last_name", "type":"string", "nullable":false},
            {"metadata":{}, "name":"subjects", "type":"array","items":{"type":["string", "string"]}, "nullable":false},
            {"metadata":{}, "name":"marks", "type":"array","items":{"type":["integer", "integer"]}, "nullable":false},
            {"metadata":{}, "name":"dept", "type":"string", "nullable":false}       
        ]
    }

}

EG JSON DATA:

{
    "table1": [
        {
            "first_name":"john",
            "last_name":"doe",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        },
        {
            "first_name":"dan",
            "last_name":"steyn",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        },
        {
            "first_name":"rose",
            "last_name":"wayne",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"            
        },
        {
            "first_name":"nat",
            "last_name":"lee",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        },
        {
            "first_name":"jim",
            "last_name":"lim",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        }       
    ]
}

I want to create the equivalent spark schema from this json file. Below is my code: (reference: Create spark dataframe schema from json schema representation)

with open(schemaFile) as s:
 schema = json.load(s)["table1"]
 source_schema = StructType.fromJson(schema)

The above code works fine if i dont have any array columns. But throws the below error if i have array columns in my schema.

"Could not parse datatype: array" ("Could not parse datatype: %s" json_value)

634

asked May 28 '19 09:05

blackfury

1 Answers

In your case there was an issue with the representation of the arrays. The correct syntax is:

{ "metadata": {}, "name": "marks", "nullable": true, "type": {"containsNull": true, "elementType": "long", "type": "array" } }.

In order to retrieve the schema from json you can write the next pyspark snippet:

jsonData = """{
    "table1": [{
            "first_name": "john",
            "last_name": "doe",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "dan",
            "last_name": "steyn",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "rose",
            "last_name": "wayne",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "nat",
            "last_name": "lee",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "jim",
            "last_name": "lim",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        }
    ]
}"""

df = spark.read.json(sc.parallelize([jsonData]))

df.schema.json()

This should output:

{
    "fields": [{
        "metadata": {},
        "name": "table1",
        "nullable": true,
        "type": {
            "containsNull": true,
            "elementType": {
                "fields": [{
                    "metadata": {},
                    "name": "dept",
                    "nullable": true,
                    "type": "string"
                }, {
                    "metadata": {},
                    "name": "first_name",
                    "nullable": true,
                    "type": "string"
                }, {
                    "metadata": {},
                    "name": "last_name",
                    "nullable": true,
                    "type": "string"
                }, {
                    "metadata": {},
                    "name": "marks",
                    "nullable": true,
                    "type": {
                        "containsNull": true,
                        "elementType": "long",
                        "type": "array"
                    }
                }, {
                    "metadata": {},
                    "name": "subjects",
                    "nullable": true,
                    "type": {
                        "containsNull": true,
                        "elementType": "string",
                        "type": "array"
                    }
                }],
                "type": "struct"
            },
            "type": "array"
        }
    }],
    "type": "struct"
}

Alternatively, you could use df.schema.simpleString() this will return an relatively simpler schema format:

struct<table1:array<struct<dept:string,first_name:string,last_name:string,marks:array<bigint>,subjects:array<string>>>>

Finally you can store the schema above into a file and load it later on with:

import json
new_schema = StructType.fromJson(json.loads(schema_json))

As you did already. Remember that you could achieve the described process dynamically as well for any json data.

198

answered Nov 15 '22 11:11

abiratsis

Related questions
                            
                                Spark from_json - StructType and ArrayType
                            
                                Apache-Beam + Python: Writing JSON (or dictionaries) strings to output file
                            
                                Pass arbitrary Javascript data object to Node.js C++ addon
                            
                                How to unmarshal json body to list of myclass in camel
                            
                                jq not replacing json value with parameter
                            
                                React Native fetch returns error: JSON Unexpected EOF
                            
                                Multiline File in to single JSON string
                            
                                Google maps Geojson infowindow
                            
                                JSON pretty format only of part of a file in vim
                            
                                add JSONArray within a JSONObject
                            
                                cannot extract elements from a scalar
                            
                                What is the time complexity of JSON.stringify()?
                            
                                How to add new node to Json using JsonPath?
                            
                                Convert JSON Dictionary to JSON Array in python
                            
                                How to get the raw JSON response of a HTTP request from `driver.page_source` in Selenium webdriver Firefox
                            
                                Can PostgreSQL JOIN on jsonb array objects?
                            
                                Encounter: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
                            
                                C# WebClient using UploadString to call HttpPost method from an ApiController also in C#. 415 or 400 error
                            
                                Eroor : Type '_InternalLinkedHashMap<String, dynamic>' is not a subtype of type 'List<dynamic>' in type cast
                            
                                A private key for specified extension already exists. Reuse that key or delete it first

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark: Create Schema from Json Schema involving Array columns

Tags:

json

dataframe

schema

pyspark

blackfury

People also ask

1 Answers

abiratsis

Recent Activity

Donate For Us