Apache Spark how to append new column from list/array to Spark dataframe

Question

I am using Apache Spark 2.0 Dataframe/Dataset API I want to add a new column to my dataframe from List of values. My list has same number of values like given dataframe.

val list = List(4,5,10,7,2)
val df   = List("a","b","c","d","e").toDF("row1")

I would like to do something like:

val appendedDF = df.withColumn("row2",somefunc(list))
df.show()
// +----+------+
// |row1 |row2 |
// +----+------+
// |a    |4    |
// |b    |5    |
// |c    |10   |
// |d    |7    |
// |e    |2    |
// +----+------+

For any ideas I would be greatful, my dataframe in reality contains more columns.

Psidom · Accepted Answer

You could do it like this:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._    

// create rdd from the list
val rdd = sc.parallelize(List(4,5,10,7,2))
// rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:28

// zip the data frame with rdd
val rdd_new = df.rdd.zip(rdd).map(r => Row.fromSeq(r._1.toSeq ++ Seq(r._2)))
// rdd_new: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[33] at map at <console>:32

// create a new data frame from the rdd_new with modified schema
spark.createDataFrame(rdd_new, df.schema.add("new_col", IntegerType)).show
+----+-------+
|row1|new_col|
+----+-------+
|   a|      4|
|   b|      5|
|   c|     10|
|   d|      7|
|   e|      2|
+----+-------+

Tzach Zohar · Answer

Adding for completeness: the fact that the input list (which exists in driver memory) has the same size as the DataFrame suggests that this is a small DataFrame to begin with - so you might consider collect()-ing it, zipping with list, and converting back into a DataFrame if needed:

df.collect()
  .map(_.getAs[String]("row1"))
  .zip(list).toList
  .toDF("row1", "row2")

That won't be faster, but if the data is really small it might be negligible and the code is (arguably) clearer.

Apache Spark how to append new column from list/array to Spark dataframe

Tags:

dataframe

scala

apache-spark

apache-spark-sql

Stefan Repcek

2 Answers

Psidom

Tzach Zohar

Recent Activity

Donate For Us

Apache Spark how to append new column from list/array to Spark dataframe

Tags:

dataframe

scala

apache-spark

apache-spark-sql

Stefan Repcek

2 Answers

Psidom

Tzach Zohar

Related questions

Recent Activity

Donate For Us