Spark: 3.0.0
Scala: 2.12.8
My data frame has a column with JSON string, and I want to create a new column from it with the StructType.
| temp_json_string |
|---|
| {"name":"test","id":"12","category":[{"products":["A","B"],"displayName":"test_1","displayLabel":"test1"},{"products":["C"],"displayName":"test_2","displayLabel":"test2"}],"createdAt":"","createdBy":""} |
root
|-- temp_json_string: string (nullable = true)
Formatted JSON:
{
"name":"test",
"id":"12",
"category":[
{
"products":[
"A",
"B"
],
"displayName":"test_1",
"displayLabel":"test1"
},
{
"products":[
"C"
],
"displayName":"test_2",
"displayLabel":"test2"
}
],
"createdAt":"",
"createdBy":""
}
I want to create a new column of type Struct so I tried:
dataFrame
.withColumn("temp_json_struct", struct(col("temp_json_string")))
.select("temp_json_struct")
Now, I get the schema as:
root
|-- temp_json_struct: struct (nullable = false)
| |-- temp_json_string: string (nullable = true)
Desired result:
root
|-- temp_json_struct: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- category: array (nullable = true)
| | |-- products: array (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- displayLabel: string (nullable = true)
| |-- createdAt: timestamp (nullable = true)
| |-- updatedAt: timestamp (nullable = true)
json_str_col is the column that has JSON string. I had multiple files so that's why the fist line is iterating through each row to extract the schema. If you know your schema up front then just replace json_schema with that.
json_schema = spark.read.json(df.rdd.map(lambda row: row.json_str_col)).schema
df = df.withColumn('new_col', from_json(col('json_str_col'), json_schema))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With