I have the following DataFrame:
January | February | March
-----------------------------
10 | 10 | 10
20 | 20 | 20
50 | 50 | 50
I'm trying to add a column to this which is the sum of the values of each row.
January | February | March | TOTAL
----------------------------------
10 | 10 | 10 | 30
20 | 20 | 20 | 60
50 | 50 | 50 | 150
As far as I can see, all the built in aggregate functions seem to be for calculating values in single columns. How do I go about using values across columns on a per row basis (using Scala)?
I've gotten as far as
val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...
You were very close with this:
val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...
Instead, try this:
val newDf = df.select(colsToSum.map(col).reduce((c1, c2) => c1 + c2) as "sum")
I think this is the best of the the answers, because it is as fast as the answer with the hard-coded SQL query, and as convenient as the one that uses the UDF
. It's the best of both worlds -- and I didn't even add a full line of code!
Alternatively and using Hugo's approach and example, you can create a UDF
that receives any quantity of columns and sum
them all.
from functools import reduce
def superSum(*cols):
return reduce(lambda a, b: a + b, cols)
add = udf(superSum)
df.withColumn('total', add(*[df[x] for x in df.columns])).show()
+-------+--------+-----+-----+
|January|February|March|total|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With