I have defined my schema for the df in a json file as follows:
{
"table1":{
"fields":[
{"metadata":{}, "name":"first_name", "type":"string", "nullable":false},
{"metadata":{}, "name":"last_name", "type":"string", "nullable":false},
{"metadata":{}, "name":"subjects", "type":"array","items":{"type":["string", "string"]}, "nullable":false},
{"metadata":{}, "name":"marks", "type":"array","items":{"type":["integer", "integer"]}, "nullable":false},
{"metadata":{}, "name":"dept", "type":"string", "nullable":false}
]
}
}
EG JSON DATA:
{
"table1": [
{
"first_name":"john",
"last_name":"doe",
"subjects":["maths","science"],
"marks":[90,67],
"dept":"abc"
},
{
"first_name":"dan",
"last_name":"steyn",
"subjects":["maths","science"],
"marks":[90,67],
"dept":"abc"
},
{
"first_name":"rose",
"last_name":"wayne",
"subjects":["maths","science"],
"marks":[90,67],
"dept":"abc"
},
{
"first_name":"nat",
"last_name":"lee",
"subjects":["maths","science"],
"marks":[90,67],
"dept":"abc"
},
{
"first_name":"jim",
"last_name":"lim",
"subjects":["maths","science"],
"marks":[90,67],
"dept":"abc"
}
]
}
I want to create the equivalent spark schema from this json file. Below is my code: (reference: Create spark dataframe schema from json schema representation)
with open(schemaFile) as s:
schema = json.load(s)["table1"]
source_schema = StructType.fromJson(schema)
The above code works fine if i dont have any array columns. But throws the below error if i have array columns in my schema.
"Could not parse datatype: array" ("Could not parse datatype: %s" json_value)
Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset<Row> . This conversion can be done using SparkSession. read(). json() on either a Dataset<String> , or a JSON file.
Spark Read JSON with schemaUse the StructType class to create a custom schema, below we initiate this class and use add a method to add columns to it by providing the column name, data type and nullable option. The above example ignores the default schema and uses the custom schema while reading a JSON file.
In your case there was an issue with the representation of the arrays. The correct syntax is:
{ "metadata": {},
"name": "marks",
"nullable": true, "type": {"containsNull": true, "elementType": "long", "type": "array" } }
.
In order to retrieve the schema from json you can write the next pyspark snippet:
jsonData = """{
"table1": [{
"first_name": "john",
"last_name": "doe",
"subjects": ["maths", "science"],
"marks": [90, 67],
"dept": "abc"
},
{
"first_name": "dan",
"last_name": "steyn",
"subjects": ["maths", "science"],
"marks": [90, 67],
"dept": "abc"
},
{
"first_name": "rose",
"last_name": "wayne",
"subjects": ["maths", "science"],
"marks": [90, 67],
"dept": "abc"
},
{
"first_name": "nat",
"last_name": "lee",
"subjects": ["maths", "science"],
"marks": [90, 67],
"dept": "abc"
},
{
"first_name": "jim",
"last_name": "lim",
"subjects": ["maths", "science"],
"marks": [90, 67],
"dept": "abc"
}
]
}"""
df = spark.read.json(sc.parallelize([jsonData]))
df.schema.json()
This should output:
{
"fields": [{
"metadata": {},
"name": "table1",
"nullable": true,
"type": {
"containsNull": true,
"elementType": {
"fields": [{
"metadata": {},
"name": "dept",
"nullable": true,
"type": "string"
}, {
"metadata": {},
"name": "first_name",
"nullable": true,
"type": "string"
}, {
"metadata": {},
"name": "last_name",
"nullable": true,
"type": "string"
}, {
"metadata": {},
"name": "marks",
"nullable": true,
"type": {
"containsNull": true,
"elementType": "long",
"type": "array"
}
}, {
"metadata": {},
"name": "subjects",
"nullable": true,
"type": {
"containsNull": true,
"elementType": "string",
"type": "array"
}
}],
"type": "struct"
},
"type": "array"
}
}],
"type": "struct"
}
Alternatively, you could use df.schema.simpleString()
this will return an relatively simpler schema format:
struct<table1:array<struct<dept:string,first_name:string,last_name:string,marks:array<bigint>,subjects:array<string>>>>
Finally you can store the schema above into a file and load it later on with:
import json
new_schema = StructType.fromJson(json.loads(schema_json))
As you did already. Remember that you could achieve the described process dynamically as well for any json data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With