I have a data set that comes in as XML, and one of the nodes contains JSON. Spark is reading this in as a StringType, so I am trying to use from_json() to convert the JSON to a DataFrame.
I am able to convert a string of JSON, but how do I write the schema to work with an Array?
String without Array - Working nicely
import org.apache.spark.sql.functions._
val schemaExample = new StructType()
.add("FirstName", StringType)
.add("Surname", StringType)
val dfExample = spark.sql("""select "{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }" as theJson""")
val dfICanWorkWith = dfExample.select(from_json($"theJson", schemaExample))
dfICanWorkWith.collect()
// Results \\
res19: Array[org.apache.spark.sql.Row] = Array([[Johnny,Boy]])
String with an Array - Can't figure this one out
import org.apache.spark.sql.functions._
val schemaExample2 = new StructType()
.add("", ArrayType(new StructType()
.add("FirstName", StringType)
.add("Surname", StringType)
)
)
val dfExample2= spark.sql("""select "[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }" as theJson""")
val dfICanWorkWith = dfExample2.select(from_json($"theJson", schemaExample2))
dfICanWorkWith.collect()
// Result \\
res22: Array[org.apache.spark.sql.Row] = Array([null])
Spark ArrayType is a collection data type that extends the DataType class which is a superclass of all types in Spark. All elements of ArrayType should have the same type of elements.
StructType – Defines the structure of the Dataframe Spark provides spark. sql. types. StructType class to define the structure of the DataFrame and It is a collection or list on StructField objects. By calling Spark DataFrame printSchema() print the schema on console where StructType columns are represented as struct .
Create PySpark ArrayType You can create an instance of an ArrayType using ArraType() class, This takes arguments valueType and one optional argument valueContainsNull to specify if a value can accept null, by default it takes True. valueType should be a PySpark type that extends DataType class.
from_json (col, schema, options={})[source] Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null , in the case of an unparseable string.
The problem is that you don't have a fully qualified json. Your json is missing a couple of things:
Try replacing it with:
val dfExample2= spark.sql("""select "{\"\":[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]}" as theJson""")
and you will get:
scala> dfICanWorkWith.collect()
res12: Array[org.apache.spark.sql.Row] = Array([[WrappedArray([Johnny,Boy], [Franky,Man])]])
as of spark 2.4 the schema_of_json
function helps:
> SELECT schema_of_json('[{"col":0}]');
array<struct<col:int>>
in your case you can then use the below code to parse that array of son objects:
scala> spark.sql("""select from_json("[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]", 'array<struct<FirstName:string,Surname:string>>' ) as theJson""").show(false)
+------------------------------+
|theJson |
+------------------------------+
|[[Johnny, Boy], [Franky, Man]]|
+------------------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With