Let's say I have a dataframe which looks like this: <pre class="prettyprint"><code>+--------------------+--------------------+--------------------------------------------------------------+ | id | Name | Payment| +--------------------+--------------------+--------------------------------------------------------------+ | 1 | James |[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]| +--------------------+--------------------+--------------------------------------------------------------+ </code></pre> And the schema is: root <pre class="prettyprint"><code>|-- id: integer (nullable = true) |-- Name: string (nullable = true) |-- Payment: string (nullable = true) </code></pre> How can I explode the above JSON array into below: <pre class="prettyprint"><code>+--------------------+--------------------+-------------------------------+ | id | Name | Payment| +--------------------+--------------------+-------------------------------+ | 1 | James | {"@id":1, "currency":"GBP"} | +--------------------+--------------------+-------------------------------+ | 1 | James | {"@id":2, "currency":"USD"} | +--------------------+--------------------+-------------------------------+ </code></pre> I've been trying to use the explode functionality like the below, but it's not working. It's giving an error about not being able to explode string types, and that it expects either a map or array. This makes sense given the schema denotes it's a string, rather than an array/map, but I'm not sure how to convert this into an appropriate format. <pre class="prettyprint"><code>val newDF = dataframe.withColumn("nestedPayment", explode(dataframe.col("Payment"))) </code></pre> Any help is greatly appreciated!

My solution is wrap your json array string into a json string to use <code>from_json</code> function with struct type of array of string <pre class="prettyprint"><code>val dataframe = spark.sparkContext.parallelize(Seq(("1", "James", "[ {\"@id\": 1, \"currency\":\"GBP\"},{\"@id\": 2, \"currency\": \"USD\"} ]"))).toDF("id", "Name", "Payment") val result = dataframe.withColumn("wrapped_json", concat_ws("", lit("{\"array\":"), col("Payment"), lit("}"))) .withColumn("array_json", from_json(col("wrapped_json"), StructType(Seq(StructField("array", ArrayType(StringType)))))) .withColumn("result", explode(col("array_json.array"))) </code></pre> Result: <pre class="prettyprint"><code>+---+-----+--------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------+--------------------------+ |id |Name |Payment |wrapped_json |array_json |result | +---+-----+--------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------+--------------------------+ |1 |James|[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]|{"array":[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]}|[[{"@id":1,"currency":"GBP"}, {"@id":2,"currency":"USD"}]]|{"@id":1,"currency":"GBP"}| |1 |James|[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]|{"array":[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]}|[[{"@id":1,"currency":"GBP"}, {"@id":2,"currency":"USD"}]]|{"@id":2,"currency":"USD"}| +---+-----+--------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------+--------------------------+ </code></pre> I am using spark 2.3.2 and Kudakwashe Nyatsanza's solution not work for me, It throw <code>org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(value)' due to data type mismatch: Input schema array<string> must be a struct or an array of structs.</code>

You'll have to parse the JSON string into an array of JSONs, and then use <code>explode</code> on the result (explode expects an array). To do that (assuming Spark 2.0.*): <ul> <li> If you know all <code>Payment</code> values contain a json representing an array with the same size (e.g. 2 in this case), you can hard-code extraction of the first and second elements, wrap them in an array and explode: <pre class="prettyprint"><code>val newDF = dataframe.withColumn("Payment", explode(array( get_json_object($"Payment", "$[0]"), get_json_object($"Payment", "$[1]") ))) </code></pre> </li> <li> If you can't guarantee all records have a JSON with a 2-element array, but you can guarantee a maximum length of these arrays, you can use this trick to parse elements up to the maximum size and then filter out the resulting <code>null</code>s: <pre class="prettyprint"><code>val maxJsonParts = 3 // whatever that number is... val jsonElements = (0 until maxJsonParts) .map(i => get_json_object($"Payment", s"$$[$i]")) val newDF = dataframe .withColumn("Payment", explode(array(jsonElements: _*))) .where(!isnull($"Payment")) </code></pre> </li> </ul>

dataframe Spark scala explode json array

Tags:

json

dataframe

scala

apache-spark

apache-spark-sql

Let's say I have a dataframe which looks like this:

+--------------------+--------------------+--------------------------------------------------------------+
|                id  |           Name     |                                                       Payment|
+--------------------+--------------------+--------------------------------------------------------------+
|                1   |           James    |[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]|
+--------------------+--------------------+--------------------------------------------------------------+

And the schema is:

root

|-- id: integer (nullable = true)
|-- Name: string (nullable = true)   
|-- Payment: string (nullable = true)

How can I explode the above JSON array into below:

+--------------------+--------------------+-------------------------------+
|                id  |           Name     |                        Payment|
+--------------------+--------------------+-------------------------------+
|                1   |           James    |   {"@id":1, "currency":"GBP"} |
+--------------------+--------------------+-------------------------------+
|                1   |           James    |   {"@id":2, "currency":"USD"} |
+--------------------+--------------------+-------------------------------+

I've been trying to use the explode functionality like the below, but it's not working. It's giving an error about not being able to explode string types, and that it expects either a map or array. This makes sense given the schema denotes it's a string, rather than an array/map, but I'm not sure how to convert this into an appropriate format.

val newDF = dataframe.withColumn("nestedPayment", explode(dataframe.col("Payment")))

Any help is greatly appreciated!

344

asked Mar 16 '17 19:03

Richard

3 Answers

My solution is wrap your json array string into a json string to use from_json function with struct type of array of string

val dataframe = spark.sparkContext.parallelize(Seq(("1", "James", "[ {\"@id\": 1, \"currency\":\"GBP\"},{\"@id\": 2, \"currency\": \"USD\"} ]"))).toDF("id", "Name", "Payment")
val result = dataframe.withColumn("wrapped_json", concat_ws("", lit("{\"array\":"), col("Payment"), lit("}")))
    .withColumn("array_json", from_json(col("wrapped_json"), StructType(Seq(StructField("array", ArrayType(StringType))))))
    .withColumn("result", explode(col("array_json.array")))

Result:

+---+-----+--------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------+--------------------------+
|id |Name |Payment                                                       |wrapped_json                                                            |array_json                                                |result                    |
+---+-----+--------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------+--------------------------+
|1  |James|[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]|{"array":[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]}|[[{"@id":1,"currency":"GBP"}, {"@id":2,"currency":"USD"}]]|{"@id":1,"currency":"GBP"}|
|1  |James|[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]|{"array":[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]}|[[{"@id":1,"currency":"GBP"}, {"@id":2,"currency":"USD"}]]|{"@id":2,"currency":"USD"}|
+---+-----+--------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------+--------------------------+

I am using spark 2.3.2 and Kudakwashe Nyatsanza's solution not work for me, It throw org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(value)' due to data type mismatch: Input schema array<string> must be a struct or an array of structs.

196

answered Sep 18 '22 15:09

Apollo

import org.apache.spark.sql.types._

val newDF = dataframe.withColumn("Payment", 
explode(
from_json(
  get_json_object($"Payment", "$."),ArrayType(StringType)
)))

answered Sep 19 '22 15:09

Kudakwashe Nyatsanza

You'll have to parse the JSON string into an array of JSONs, and then use explode on the result (explode expects an array).

To do that (assuming Spark 2.0.*):

If you know all Payment values contain a json representing an array with the same size (e.g. 2 in this case), you can hard-code extraction of the first and second elements, wrap them in an array and explode:
```
val newDF = dataframe.withColumn("Payment", explode(array(
  get_json_object($"Payment", "$[0]"),
  get_json_object($"Payment", "$[1]")
)))
```

If you can't guarantee all records have a JSON with a 2-element array, but you can guarantee a maximum length of these arrays, you can use this trick to parse elements up to the maximum size and then filter out the resulting nulls:

val maxJsonParts = 3 // whatever that number is...
val jsonElements = (0 until maxJsonParts)
                     .map(i => get_json_object($"Payment", s"$$[$i]"))

val newDF = dataframe
  .withColumn("Payment", explode(array(jsonElements: _*)))
  .where(!isnull($"Payment"))

answered Sep 20 '22 15:09

Tzach Zohar

Related questions
                            
                                Unit testing Django JSON View
                            
                                Parsing a JSON date info into a C# DateTime
                            
                                Custom Model Binder Not Validating Model
                            
                                How to add new object in JSON using jQuery or JavaScript?
                            
                                Parsing JSON on google app engine (java)
                            
                                Appengine java - Jersey/Jackson JaxbAnnotationIntrospector NoClassDefFoundError
                            
                                How to get JSON serialized string using Dart Serialization
                            
                                iOS: Serialize/Deserialize complex JSON generically from NSObject class
                            
                                Thinking of storing serialized java objects into cassandra as JSON. What is the catch?
                            
                                JSON decoding: Unexpected token: StartArray
                            
                                Difference between JSON and SQL
                            
                                How to serialize Scala Map to Json in PlayFramework?
                            
                                Gson deserialize into map
                            
                                Json.net override method in DefaultContractResolver to deserialize private setters
                            
                                Convert (Parse) URL parameteres to JSON in Java
                            
                                Swift 2 : NSData(contentsOfURL:url) returning nil
                            
                                How can I loosen up the naming strategy when deserializing using Jackson?
                            
                                JSON.parse() equivalent in mongo driver 3.x for Java
                            
                                Eclipse Neon Json editor - allow comments
                            
                                How do I define a Json Schema containing definitions, in code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With