Error : Union can only be performed on tables with the compatible column types. struct(tier:string,skyward_number:string,skyward_points:string) <> struct(skyward_number:string,tier:string,skyward_points:string) at the first column of the second table;;
Here order of the struct fields is different but rest everything is same.
dataframe1 Schema
root
|-- emcg_uuid: string (nullable = true)
|-- name: string (nullable = true)
|-- phone_no: string (nullable = true)
|-- dob: string (nullable = true)
|-- country: string (nullable = true)
|-- travel_type: string (nullable = true)
|-- gdpr_restricted_flg: string (nullable = false)
|-- gdpr_reason_code: string (nullable = false)
|-- document: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- skyward: struct (nullable = false)
| |-- tier: string (nullable = false)
| |-- skyward_number: string (nullable = false)
| |-- skyward_points: string (nullable = false)
dataframe2 schema
root
|-- emcg_uuid: string (nullable = true)
|-- name: string (nullable = true)
|-- phone_no: string (nullable = true)
|-- dob: string (nullable = true)
|-- country: string (nullable = true)
|-- travel_type: string (nullable = true)
|-- gdpr_restricted_flg: string (nullable = true)
|-- gdpr_reason_code: string (nullable = true)
|-- document: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- skyward: struct (nullable = false)
| |-- skyward_number: string (nullable = false)
| |-- tier: string (nullable = false)
| |-- skyward_points: string (nullable = false)
How to solve this?
The Union is a transformation in Spark that is used to work with multiple data frames in Spark. It takes the data frame as the input and the return type is a new data frame containing the elements that are in data frame1 as well as in data frame2.
Using Spark Union and UnionAll, you can merge data of 2 Dataframes and create a new Dataframe. Remember, you can merge 2 Spark Dataframes only when they have the same schema. Union All has been deprecated since SPARK 2.0, and it is not in use any longer.
PySpark Merge Two DataFrames with Different Columns resolves columns by name (not by position). In other words, unionByName() is used to merge two DataFrame's by column names instead of by position.
Here In first dataframe (dataframe1) , the columns ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are ['ID','Age']. Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.
The default Spark behaviour for union
is standard SQL behaviour, so match-by-position. This means, the schema in both DataFrames must contain the same fields with the same fields in the same order.
If you want to match schema by name, use unionByName
, introduced in Spark 2.3.
You can also re-map fields:
val df1 = ...
val df2 = /...
df1.toDF(df2.columns: _*).union(df2)
Edit: I saw the edit now.
You can add again those columns:
import org.apache.spark.sql.functions._
val withCorrectedStruct = df1.withColumn("skyward", struct($"skyward_number", $"tier", $"skyward_points"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With