Spark : Union can only be performed on tables with the compatible column types. Struct<name,id> != Struct<id,name>

Tags:

Error : Union can only be performed on tables with the compatible column types. struct(tier:string,skyward_number:string,skyward_points:string) <> struct(skyward_number:string,tier:string,skyward_points:string) at the first column of the second table;;

Here order of the struct fields is different but rest everything is same.

dataframe1 Schema

root
 |-- emcg_uuid: string (nullable = true)
 |-- name: string (nullable = true)
 |-- phone_no: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- country: string (nullable = true)
 |-- travel_type: string (nullable = true)
 |-- gdpr_restricted_flg: string (nullable = false)
 |-- gdpr_reason_code: string (nullable = false)
 |-- document: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- skyward: struct (nullable = false)
 |    |-- tier: string (nullable = false)
 |    |-- skyward_number: string (nullable = false)
 |    |-- skyward_points: string (nullable = false)

dataframe2 schema
root
 |-- emcg_uuid: string (nullable = true)
 |-- name: string (nullable = true)
 |-- phone_no: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- country: string (nullable = true)
 |-- travel_type: string (nullable = true)
 |-- gdpr_restricted_flg: string (nullable = true)
 |-- gdpr_reason_code: string (nullable = true)
 |-- document: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- skyward: struct (nullable = false)
 |    |-- skyward_number: string (nullable = false)
 |    |-- tier: string (nullable = false)
 |    |-- skyward_points: string (nullable = false)

How to solve this?

311

asked Sep 05 '18 13:09

Ravi

1 Answers

The default Spark behaviour for union is standard SQL behaviour, so match-by-position. This means, the schema in both DataFrames must contain the same fields with the same fields in the same order.

If you want to match schema by name, use unionByName, introduced in Spark 2.3.

You can also re-map fields:

val df1 = ...
val df2 = /...
df1.toDF(df2.columns: _*).union(df2)

Edit: I saw the edit now.

You can add again those columns:

import org.apache.spark.sql.functions._
val withCorrectedStruct = df1.withColumn("skyward", struct($"skyward_number", $"tier", $"skyward_points"))

157

answered Oct 13 '22 00:10

T. Gawęda

Related questions
                            
                                Re-using A Schema from JSON within a Spark DataFrame using Scala
                            
                                Reading large file in Spark issue - python
                            
                                spark executor out of memory in join and reduceByKey
                            
                                Cannot load main class from JAR file
                            
                                How to do non-random Dataset splitting on Apache Spark?
                            
                                How save list to file in spark?
                            
                                PySpark - Add a new nested column or change the value of existing nested columns
                            
                                SparkContext setLocalProperties
                            
                                How to find first non-null values in groups? (secondary sorting using dataset api)
                            
                                Difference between combinebykey and aggregatebykey
                            
                                Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
                            
                                Can we able to use mulitple sparksessions to access two different Hive servers
                            
                                Configure Zeppelin's Spark Interpreter on EMR when starting a cluster
                            
                                When should I repartition an RDD?
                            
                                Can I run a pyspark jupyter notebook in cluster deploy mode?
                            
                                Does Spark do one pass through the data for multiple withColumn?
                            
                                What exactly does .select() do?
                            
                                Joining a large and a massive spark dataframe
                            
                                Python - Pickle Spacy for PySpark
                            
                                java.lang.AssertionError: assertion failed: No plan for HiveTableRelation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark : Union can only be performed on tables with the compatible column types. Struct<name,id> != Struct<id,name>

Tags:

union

struct

apache-spark

apache-spark-sql

Ravi

People also ask

1 Answers

T. Gawęda

Recent Activity

Donate For Us