Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark from_json - StructType and ArrayType

I have a data set that comes in as XML, and one of the nodes contains JSON. Spark is reading this in as a StringType, so I am trying to use from_json() to convert the JSON to a DataFrame.

I am able to convert a string of JSON, but how do I write the schema to work with an Array?

String without Array - Working nicely

import org.apache.spark.sql.functions._

val schemaExample = new StructType()
          .add("FirstName", StringType)
          .add("Surname", StringType)

val dfExample = spark.sql("""select "{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }" as theJson""")

val dfICanWorkWith = dfExample.select(from_json($"theJson", schemaExample))

dfICanWorkWith.collect()

// Results \\
res19: Array[org.apache.spark.sql.Row] = Array([[Johnny,Boy]])

String with an Array - Can't figure this one out

import org.apache.spark.sql.functions._

val schemaExample2 = new StructType()
                              .add("", ArrayType(new StructType()
                                                          .add("FirstName", StringType)
                                                          .add("Surname", StringType)
                                                )
                                  )

val dfExample2= spark.sql("""select "[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }" as theJson""")

val dfICanWorkWith = dfExample2.select(from_json($"theJson", schemaExample2))

dfICanWorkWith.collect()

// Result \\
res22: Array[org.apache.spark.sql.Row] = Array([null])
like image 378
Dragick Avatar asked Jul 10 '17 03:07

Dragick


People also ask

What is ArrayType in spark?

Spark ArrayType is a collection data type that extends the DataType class which is a superclass of all types in Spark. All elements of ArrayType should have the same type of elements.

What is spark StructType?

StructType – Defines the structure of the Dataframe Spark provides spark. sql. types. StructType class to define the structure of the DataFrame and It is a collection or list on StructField objects. By calling Spark DataFrame printSchema() print the schema on console where StructType columns are represented as struct .

How does PySpark define ArrayType?

Create PySpark ArrayType You can create an instance of an ArrayType using ArraType() class, This takes arguments valueType and one optional argument valueContainsNull to specify if a value can accept null, by default it takes True. valueType should be a PySpark type that extends DataType class.

What is From_json in PySpark?

from_json (col, schema, options={})[source] Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null , in the case of an unparseable string.


2 Answers

The problem is that you don't have a fully qualified json. Your json is missing a couple of things:

  • First you are missing the surrounding {} in which the json is done
  • Second you are missing the variable value (you set it as "" but did not add it)
  • Lastly you are missing the closing ]

Try replacing it with:

val dfExample2= spark.sql("""select "{\"\":[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]}" as theJson""")

and you will get:

scala> dfICanWorkWith.collect()
res12: Array[org.apache.spark.sql.Row] = Array([[WrappedArray([Johnny,Boy], [Franky,Man])]])
like image 66
Assaf Mendelson Avatar answered Oct 20 '22 22:10

Assaf Mendelson


as of spark 2.4 the schema_of_json function helps:

> SELECT schema_of_json('[{"col":0}]');
  array<struct<col:int>>

in your case you can then use the below code to parse that array of son objects:

scala> spark.sql("""select from_json("[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]", 'array<struct<FirstName:string,Surname:string>>' ) as theJson""").show(false)
+------------------------------+
|theJson                       |
+------------------------------+
|[[Johnny, Boy], [Franky, Man]]|
+------------------------------+
like image 26
parisni Avatar answered Oct 20 '22 21:10

parisni