Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - Sum of row values

I have the following DataFrame:

January | February | March
-----------------------------
  10    |    10    |  10
  20    |    20    |  20
  50    |    50    |  50

I'm trying to add a column to this which is the sum of the values of each row.

January | February | March  | TOTAL
----------------------------------
  10    |    10    |   10   |  30
  20    |    20    |   20   |  60
  50    |    50    |   50   |  150

As far as I can see, all the built in aggregate functions seem to be for calculating values in single columns. How do I go about using values across columns on a per row basis (using Scala)?

I've gotten as far as

val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...
like image 621
karoma Avatar asked Apr 01 '16 15:04

karoma


2 Answers

You were very close with this:

val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...

Instead, try this:

val newDf = df.select(colsToSum.map(col).reduce((c1, c2) => c1 + c2) as "sum")

I think this is the best of the the answers, because it is as fast as the answer with the hard-coded SQL query, and as convenient as the one that uses the UDF. It's the best of both worlds -- and I didn't even add a full line of code!

like image 104
David Griffin Avatar answered Oct 28 '22 13:10

David Griffin


Alternatively and using Hugo's approach and example, you can create a UDF that receives any quantity of columns and sum them all.

from functools import reduce

def superSum(*cols):
   return reduce(lambda a, b: a + b, cols)

add = udf(superSum)

df.withColumn('total', add(*[df[x] for x in df.columns])).show()


+-------+--------+-----+-----+
|January|February|March|total|
+-------+--------+-----+-----+
|     10|      10|   10|   30|
|     20|      20|   20|   60|
+-------+--------+-----+-----+
like image 33
Alberto Bonsanto Avatar answered Oct 28 '22 11:10

Alberto Bonsanto