Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge two columns of a `Dataframe` in Spark into one 2-Tuple?

I have a Spark DataFrame df with five columns. I want to add another column with its values being the tuple of the first and second columns. When using with withColumn() method, I get the mismatch error, because the input is not Column type, but instead (Column,Column). I wonder if there is a solution beside running for loop over the rows in this case?

var dfCol=(col1:Column,col2:Column)=>(col1,col2)
val vv = df.withColumn( "NewColumn", dfCol( df(df.schema.fieldNames(1)) , df(df.schema.fieldNames(2)) ) )
like image 337
TNM Avatar asked Sep 26 '15 16:09

TNM


People also ask

How do I merge two columns in spark?

Using concat() Function to Concatenate DataFrame Columns Spark SQL functions provide concat() to concatenate two or more DataFrame columns into a single Column. It can also take columns of different Data Types and concatenate them into a single column. for example, it supports String, Int, Boolean and also arrays.


2 Answers

You can use struct function which creates a tuple of provided columns:

import org.apache.spark.sql.functions.struct

val df = Seq((1,2), (3,4), (5,3)).toDF("a", "b")
df.withColumn("NewColumn", struct(df("a"), df("b")).show(false)

+---+---+---------+
|a  |b  |NewColumn|
+---+---+---------+
|1  |2  |[1,2]    |
|3  |4  |[3,4]    |
|5  |3  |[5,3]    |
+---+---+---------+
like image 200
Tautvydas Avatar answered Sep 22 '22 13:09

Tautvydas


You can merge multiple dataframe columns into one using array.

// $"*" will capture all existing columns
df.select($"*", array($"col1", $"col2").as("newCol")) 
like image 24
Abu Shoeb Avatar answered Sep 24 '22 13:09

Abu Shoeb