Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to append an element to an array column of a Spark Dataframe?

Suppose I have the following DataFrame:

scala> val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
df1: org.apache.spark.sql.DataFrame = [id: string, nums: array<int>]

scala> df1.show()
+---+----+
| id|nums|
+---+----+
|  a| [1]|
|  b| [1]|
+---+----+

And I want to add elements to the array in the nums column, so that I get something like the following:

+---+-------+
| id|nums   |
+---+-------+
|  a| [1,5] |
|  b| [1,5] |
+---+-------+

Is there a way to do this using the .withColumn() method of the DataFrame? E.g.

val df2 = df1.withColumn("nums", append(col("nums"), lit(5))) 

I've looked through the API documentation for Spark, but can't find anything that would allow me to do this. I could probably use split and concat_ws to hack something together, but I would prefer a more elegant solution if one is possible. Thanks.

like image 403
Shafique Jamal Avatar asked Apr 06 '18 04:04

Shafique Jamal


People also ask

How do I update a column in spark?

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.

How do I append a column to a DataFrame in PySpark?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .


1 Answers

import org.apache.spark.sql.functions.{lit, array, array_union}

val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
val df2 = df1.withColumn("nums", array_union($"nums", lit(Array(5))))
df2.show

+---+------+
| id|  nums|
+---+------+
|  a|[1, 5]|
|  b|[1, 5]|
+---+------+

The array_union() was added since spark 2.4.0 release on 11/2/2018, 7 months after you asked the question, :) see https://spark.apache.org/news/index.html

like image 78
Dorren Chen Avatar answered Sep 27 '22 23:09

Dorren Chen