Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark : Union can only be performed on tables with the compatible column types. Struct<name,id> != Struct<id,name>

Error : Union can only be performed on tables with the compatible column types. struct(tier:string,skyward_number:string,skyward_points:string) <> struct(skyward_number:string,tier:string,skyward_points:string) at the first column of the second table;;

Here order of the struct fields is different but rest everything is same.

dataframe1 Schema

root
 |-- emcg_uuid: string (nullable = true)
 |-- name: string (nullable = true)
 |-- phone_no: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- country: string (nullable = true)
 |-- travel_type: string (nullable = true)
 |-- gdpr_restricted_flg: string (nullable = false)
 |-- gdpr_reason_code: string (nullable = false)
 |-- document: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- skyward: struct (nullable = false)
 |    |-- tier: string (nullable = false)
 |    |-- skyward_number: string (nullable = false)
 |    |-- skyward_points: string (nullable = false)

dataframe2 schema
root
 |-- emcg_uuid: string (nullable = true)
 |-- name: string (nullable = true)
 |-- phone_no: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- country: string (nullable = true)
 |-- travel_type: string (nullable = true)
 |-- gdpr_restricted_flg: string (nullable = true)
 |-- gdpr_reason_code: string (nullable = true)
 |-- document: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- skyward: struct (nullable = false)
 |    |-- skyward_number: string (nullable = false)
 |    |-- tier: string (nullable = false)
 |    |-- skyward_points: string (nullable = false)

How to solve this?

like image 311
Ravi Avatar asked Sep 05 '18 13:09

Ravi


People also ask

How does union work in Spark?

The Union is a transformation in Spark that is used to work with multiple data frames in Spark. It takes the data frame as the input and the return type is a new data frame containing the elements that are in data frame1 as well as in data frame2.

Can we use union in Spark SQL?

Using Spark Union and UnionAll, you can merge data of 2 Dataframes and create a new Dataframe. Remember, you can merge 2 Spark Dataframes only when they have the same schema. Union All has been deprecated since SPARK 2.0, and it is not in use any longer.

How do I merge two Dataframes with different columns in Spark Scala?

PySpark Merge Two DataFrames with Different Columns resolves columns by name (not by position). In other words, unionByName() is used to merge two DataFrame's by column names instead of by position.

How do you join two Dataframes in PySpark with different column names?

Here In first dataframe (dataframe1) , the columns ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are ['ID','Age']. Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.


1 Answers

The default Spark behaviour for union is standard SQL behaviour, so match-by-position. This means, the schema in both DataFrames must contain the same fields with the same fields in the same order.

If you want to match schema by name, use unionByName, introduced in Spark 2.3.

You can also re-map fields:

val df1 = ...
val df2 = /...
df1.toDF(df2.columns: _*).union(df2)

Edit: I saw the edit now.

You can add again those columns:

import org.apache.spark.sql.functions._
val withCorrectedStruct = df1.withColumn("skyward", struct($"skyward_number", $"tier", $"skyward_points"))
like image 157
T. Gawęda Avatar answered Oct 13 '22 00:10

T. Gawęda