Spark/Scala repeated calls to withColumn() using the same function on multiple columns

Tags:

I currently have code in which I repeatedly apply the same procedure to multiple DataFrame Columns via multiple chains of .withColumn, and am wanting to create a function to streamline the procedure. In my case, I am finding cumulative sums over columns aggregated by keys:

val newDF = oldDF   .withColumn("cumA", sum("A").over(Window.partitionBy("ID").orderBy("time")))   .withColumn("cumB", sum("B").over(Window.partitionBy("ID").orderBy("time")))   .withColumn("cumC", sum("C").over(Window.partitionBy("ID").orderBy("time")))   //.withColumn(...)

What I would like is either something like:

def createCumulativeColums(cols: Array[String], df: DataFrame): DataFrame = {   // Implement the above cumulative sums, partitioning, and ordering }

or better yet:

def withColumns(cols: Array[String], df: DataFrame, f: function): DataFrame = {   // Implement a udf/arbitrary function on all the specified columns }

498

asked Dec 30 '16 17:12

Damian Satterthwaite-Phillips

Video Answer

1 Answers

You can use select with varargs including *:

import spark.implicits._  df.select($"*" +: Seq("A", "B", "C").map(c =>    sum(c).over(Window.partitionBy("ID").orderBy("time")).alias(s"cum$c") ): _*)

This:

Maps columns names to window expressions with Seq("A", ...).map(...)
Prepends all pre-existing columns with $"*" +: ....
Unpacks combined sequence with ... : _*.

and can be generalized as:

import org.apache.spark.sql.{Column, DataFrame}  /**  * @param cols a sequence of columns to transform  * @param df an input DataFrame  * @param f a function to be applied on each col in cols  */ def withColumns(cols: Seq[String], df: DataFrame, f: String => Column) =   df.select($"*" +: cols.map(c => f(c)): _*)

If you find withColumn syntax more readable you can use foldLeft:

Seq("A", "B", "C").foldLeft(df)((df, c) =>   df.withColumn(s"cum$c",  sum(c).over(Window.partitionBy("ID").orderBy("time"))) )

which can be generalized for example to:

/**  * @param cols a sequence of columns to transform  * @param df an input DataFrame  * @param f a function to be applied on each col in cols  * @param name a function mapping from input to output name.  */ def withColumns(cols: Seq[String], df: DataFrame,      f: String =>  Column, name: String => String = identity) =   cols.foldLeft(df)((df, c) => df.withColumn(name(c), f(c)))

answered Sep 19 '22 06:09

zero323

Related questions
                            
                                What is the difference between electron and electron-prebuilt?
                            
                                Pandas DataFrame sort ignoring the case
                            
                                CSS - How to have swiper slider arrows outside of slider that takes up 12 column row
                            
                                How to run asp.net mvc 4.5 in visual studio code editor?
                            
                                React & Enzyme: why isn't beforeEach() working?
                            
                                What's the difference between elb health check and ec2 health check?
                            
                                Convert JSON data from Request into Pandas DataFrame
                            
                                How to use warm_start
                            
                                Bash: usage of `true`
                            
                                How to change fragment with the Bottom Navigation Activity?
                            
                                RxJava 2 overriding IO scheduler in unit test
                            
                                'dotnet build' specify main method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With