Spark union fails with nested JSON dataframe

Tags:

I have the following two JSON files:

{
    "name" : "Agent1",
    "age" : "32",
    "details" : [{
            "d1" : 1,
            "d2" : 2
        }
    ]
}

{
    "name" : "Agent2",
    "age" : "42",
    "details" : []
}

I read them with spark:

val jsonDf1 = spark.read.json(pathToJson1)
val jsonDf2 = spark.read.json(pathToJson2)

two dataframes are created with the following schemas:

root
 |-- age: string (nullable = true)
 |-- details: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- d1: long (nullable = true)
 |    |    |-- d2: long (nullable = true)
 |-- name: string (nullable = true)

root
|-- age: string (nullable = true)
|-- details: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- name: string (nullable = true)

When I try to perform a union with these two dataframes I get this error:

jsonDf1.union(jsonDf2)


org.apache.spark.sql.AnalysisException: unresolved operator 'Union;;
'Union
:- LogicalRDD [age#0, details#1, name#2]
+- LogicalRDD [age#7, details#8, name#9]

How can I resolve this? I will get empty arrays sometimes in the JSON files the spark job will load, but it will still have to unify them, which shouldn't be a problem since the schema of the Json files is the same.

457

asked Mar 01 '17 11:03

morm

2 Answers

If you try to union the 2 dataframes you will get this :

error:org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(StringType,true) <> ArrayType(StructType(StructField(d1,StringType,true), StructField(d2,StringType,true)),true) at the second column of the second table

Json files arrive at the same time

To solve this problem, if you can read the JSON at the same time, I would suggest :

val jsonDf1 = spark.read.json("json1.json", "json2.json")

This will give this schema:

jsonDf1.printSchema
 |-- age: string (nullable = true)
 |-- details: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- d1: long (nullable = true)
 |    |    |-- d2: long (nullable = true)
 |-- name: string (nullable = true)

The data output

jsonDf1.show(10,truncate = false)
+---+-------+------+
|age|details|name  |
+---+-------+------+
|32 |[[1,2]]|Agent1|
|42 |null   |Agent2|
+---+-------+------+

Json files arrive at different times

If your json arrive at different times, as a default solution, I would recommend to read a template JSON object with a full array, that will make your dataframe with a possible empty array valid for any union. Then, you will remove with a filter this fake JSON before outputting the result:

val df = spark.read.json("jsonWithMaybeAnEmptyArray.json", 
"TemplateFakeJsonWithAFullArray.json")

df.filter($"name" !== "FakeAgent").show(1)

Please note : A Jira card has been opened to improve capability to merge SQL data types: https://issues.apache.org/jira/browse/SPARK-19536 and this kind of operation should be possible in the next Spark version.

132

answered Oct 03 '22 08:10

Paul Leclercq

polomarcus's answer led me to this solution: I couldn't read all the files at once because I got a list of files as input, and spark didn't have an API that receives a list of paths, but apparently with Scala it's possible to do this:

val files = List("path1", "path2", "path3")
val dataframe = spark.read.json(files: _*)

This way I got one dataframe containing all three files.

answered Oct 03 '22 09:10

morm

Related questions
                            
                                Use spark in a sbt project in intellij
                            
                                Scala find missing values in a range
                            
                                How to get files name with spark sc.textFile?
                            
                                Spark spark-submit --jars arguments wants comma list, how to declare a directory of jars?
                            
                                Type transformation with shapeless
                            
                                Fold over HList with unknown Types
                            
                                Scala bug or feature? Multiple assignment error with capital letter variables
                            
                                How to use Scala implicit class in Java
                            
                                Slick 3 java.time.LocalDate mapping
                            
                                Akka Streams running on cluster nodes
                            
                                lambda calculus in scala
                            
                                Implicit ClassTag in pattern matching
                            
                                How to set content type?
                            
                                Mocking generic scala method in mockito
                            
                                Add Tuple to Map?
                            
                                Multiple actor systems for an application
                            
                                Akka websocket - how to close connection by server?
                            
                                Is it possible to force named parameters in scala?
                            
                                cake pattern - why is it so complicated
                            
                                Circe Encoders and Decoders with Http4s

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark union fails with nested JSON dataframe

Tags:

union

scala

apache-spark

spark-dataframe

morm

People also ask

2 Answers

Json files arrive at the same time

Json files arrive at different times

Paul Leclercq

morm

Recent Activity

Donate For Us