DataFrame explode list of JSON objects

Tags:

I have JSON data in the following format:

{
     "date": 100
     "userId": 1
     "data": [
         {
             "timeStamp": 101,
             "reading": 1
         },
         {
             "timeStamp": 102,
             "reading": 2
         }
     ]
 }
 {
     "date": 200
     "userId": 1
     "data": [
         {
             "timeStamp": 201,
             "reading": 3
         },
         {
             "timeStamp": 202,
             "reading": 4
         }
     ]
 }

I read it into Spark SQL:

val df = SQLContext.read.json(...)
df.printSchema
// root
//  |-- date: double (nullable = true)
//  |-- userId: long (nullable = true)
//  |-- data: array (nullable = true)
//  |     |-- element: struct (containsNull = true)
//  |     |    |-- timeStamp: double (nullable = true)
//  |     |    |-- reading: double (nullable = true)

I would like to transform it in order to have one row per reading. To my understanding, every transformation should produce a new DataFrame, so the following should work:

import org.apache.spark.sql.functions.explode
val exploded = df
    .withColumn("reading", explode(df("data.reading")))
    .withColumn("timeStamp", explode(df("data.timeStamp")))
    .drop("data")
exploded.printSchema
// root
//  |-- date: double (nullable = true)
//  |-- userId: long (nullable = true)
//  |-- timeStamp: double (nullable = true)
//  |-- reading: double (nullable = true)

The resulting schema is correct, but I get every value twice:

exploded.show
// +-----------+-----------+-----------+-----------+
// |       date|     userId|  timeStamp|    reading|
// +-----------+-----------+-----------+-----------+
// |        100|          1|        101|          1|
// |        100|          1|        101|          1|
// |        100|          1|        102|          2|
// |        100|          1|        102|          2|
// |        200|          1|        201|          3|
// |        200|          1|        201|          3|
// |        200|          1|        202|          4|
// |        200|          1|        202|          4|
// +-----------+-----------+-----------+-----------+

My feeling is that there is something about the lazy evaluation of the two explodes that I don't understand.

Is there a way to get the above code to work? Or should I use a different approach all together?

370

asked Jan 28 '16 18:01

Sune Andreas Dybro Debel

1 Answers

The resulting schema is correct, but I get every value twice

While schema is correct the output you've provided doesn't reflect actual result. In practice you'll get Cartesian product of timeStamp and reading for each input row.

My feeling is that there is something about the lazy evaluation

No, it has nothing to do with lazy evaluation. The way you use explode is just wrong. To understand what is going on lets trace execution for date equal 100:

val df100 = df.where($"date" === 100)

step by step. First explode will generate two rows, one for 1 and one for 2:

val df100WithReading = df100.withColumn("reading", explode(df("data.reading")))

df100WithReading.show
// +------------------+----+------+-------+
// |              data|date|userId|reading|
// +------------------+----+------+-------+
// |[[1,101], [2,102]]| 100|     1|      1|
// |[[1,101], [2,102]]| 100|     1|      2|
// +------------------+----+------+-------+

The second explode generate two rows (timeStamp equal 101 and 102) for each row from the previous step:

val df100WithReadingAndTs = df100WithReading
  .withColumn("timeStamp", explode(df("data.timeStamp")))

df100WithReadingAndTs.show
// +------------------+----+------+-------+---------+
// |              data|date|userId|reading|timeStamp|
// +------------------+----+------+-------+---------+
// |[[1,101], [2,102]]| 100|     1|      1|      101|
// |[[1,101], [2,102]]| 100|     1|      1|      102|
// |[[1,101], [2,102]]| 100|     1|      2|      101|
// |[[1,101], [2,102]]| 100|     1|      2|      102|
// +------------------+----+------+-------+---------+

If you want correct results explode data and select afterwards:

val exploded = df.withColumn("data", explode($"data"))
  .select($"userId", $"date",
    $"data".getItem("reading"),  $"data".getItem("timestamp"))

exploded.show
// +------+----+-------------+---------------+
// |userId|date|data[reading]|data[timestamp]|
// +------+----+-------------+---------------+
// |     1| 100|            1|            101|
// |     1| 100|            2|            102|
// |     1| 200|            3|            201|
// |     1| 200|            4|            202|
// +------+----+-------------+---------------+

answered Sep 19 '22 22:09

zero323

Related questions
                            
                                Why TypeTag doesnt have method runtimeClass but Manifest and ClassTag do
                            
                                Scala idiom for partial models?
                            
                                How can I filter with inSetBind for multiple columns in Slick?
                            
                                Avoid `Boolean.box`
                            
                                akka timeout when using spray client for multiple request
                            
                                Scala Spark: Split collection into several RDD?
                            
                                The differences between underscore usage in these scala's methods
                            
                                How to use ConcurrentLinkedQueue in Scala?
                            
                                ScalaCheck - Ordered array generator
                            
                                java.lang.ClassNotFoundException,when I use "spark-submit" with a new class name rather than "SimpleApp",
                            
                                Does the scala compiler do anything to optimize implicit classes?
                            
                                Is Scala strongly typed ? [closed]
                            
                                How to provide a default typeclass for generic types in Scala?
                            
                                Scala syntax strangeness with :: and requiring lower case
                            
                                How to create a graph from Array[(Any, Any)] using Graph.fromEdgeTuples
                            
                                Performance of splitAt function on a vector
                            
                                How can you can write generic Scala enhancement methods that bind collection type as well as element type?
                            
                                Scala macro - Infer implicit value using `c.prefix`
                            
                                get size of parquet file in HDFS for repartition with Spark in Scala
                            
                                Scalatest custom matchers for 'should contain'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

DataFrame explode list of JSON objects

Tags:

scala

distributed-computing

apache-spark

apache-spark-sql

Sune Andreas Dybro Debel

People also ask

1 Answers

zero323

Recent Activity

Donate For Us