I am using Apache Spark 2.0 Dataframe/Dataset API I want to add a new column to my dataframe from List of values. My list has same number of values like given dataframe.
val list = List(4,5,10,7,2)
val df = List("a","b","c","d","e").toDF("row1")
I would like to do something like:
val appendedDF = df.withColumn("row2",somefunc(list))
df.show()
// +----+------+
// |row1 |row2 |
// +----+------+
// |a |4 |
// |b |5 |
// |c |10 |
// |d |7 |
// |e |2 |
// +----+------+
For any ideas I would be greatful, my dataframe in reality contains more columns.
You could do it like this:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
// create rdd from the list
val rdd = sc.parallelize(List(4,5,10,7,2))
// rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:28
// zip the data frame with rdd
val rdd_new = df.rdd.zip(rdd).map(r => Row.fromSeq(r._1.toSeq ++ Seq(r._2)))
// rdd_new: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[33] at map at <console>:32
// create a new data frame from the rdd_new with modified schema
spark.createDataFrame(rdd_new, df.schema.add("new_col", IntegerType)).show
+----+-------+
|row1|new_col|
+----+-------+
| a| 4|
| b| 5|
| c| 10|
| d| 7|
| e| 2|
+----+-------+
Adding for completeness: the fact that the input list
(which exists in driver memory) has the same size as the DataFrame
suggests that this is a small DataFrame to begin with - so you might consider collect()
-ing it, zipping with list
, and converting back into a DataFrame
if needed:
df.collect()
.map(_.getAs[String]("row1"))
.zip(list).toList
.toDF("row1", "row2")
That won't be faster, but if the data is really small it might be negligible and the code is (arguably) clearer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With