How to transform Spark Dataframe columns to a single column of a string array

I want to know how can I "merge" multiple dataframe columns into one as a string array?

For example, I have this dataframe:

val df = sqlContext.createDataFrame(Seq((1, "Jack", "125", "Text"), (2,"Mary", "152", "Text2"))).toDF("Id", "Name", "Number", "Comment")

Which looks like this:

scala> df.show
| Id|Name|Number|Comment|
|  1|Jack|   125|   Text|
|  2|Mary|   152|  Text2|

scala> df.printSchema
 |-- Id: integer (nullable = false)
 |-- Name: string (nullable = true)
 |-- Number: string (nullable = true)
 |-- Comment: string (nullable = true)

How can I transform it so it would look like this:

scala> df.show
| Id|             List|
|  1|  [Jack,125,Text]|
|  2| [Mary,152,Text2]|

scala> df.printSchema
 |-- Id: integer (nullable = false)
 |-- List: Array (nullable = true)
 |    |-- element: string (containsNull = true)
V. Samma Avatar asked Dec 07 '16 15:12

V. Samma

1 Answers

Use org.apache.spark.sql.functions.array:

import org.apache.spark.sql.functions._
val result = df.select($"Id", array($"Name", $"Number", $"Comment") as "List")

// +---+------------------+
// |Id |List              |
// +---+------------------+
// |1  |[Jack, 125, Text] |
// |2  |[Mary, 152, Text2]|
// +---+------------------+
Tzach Zohar Avatar answered Sep 20 '22 06:09

Tzach Zohar