Adding a column of rowsums across a list of columns in Spark Dataframe

Tags:

I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.

For example, my data looks like this:

ID var1 var2 var3 var4 var5
a   5     7    9    12   13
b   6     4    3    20   17
c   4     9    4    6    9
d   1     2    6    8    1

I want a column added summing the rows for specific columns:

ID var1 var2 var3 var4 var5   sums
a   5     7    9    12   13    46
b   6     4    3    20   17    50
c   4     9    4    6    9     32
d   1     2    6    8    10    27

I know it is possible to add columns together if you know the specific columns to add:

val newdf = df.withColumn("sumofcolumns", df("var1") + df("var2"))

But is it possible to pass a list of column names and add them together? Based off of this answer which is basically what I want but it is using the python API instead of scala (Add column sum as new column in PySpark dataframe) I think something like this would work:

//Select columns to sum
val columnstosum = ("var1", "var2","var3","var4","var5")

// Create new column called sumofcolumns which is sum of all columns listed in columnstosum
val newdf = df.withColumn("sumofcolumns", df.select(columstosum.head, columnstosum.tail: _*).sum)

This throws the error value sum is not a member of org.apache.spark.sql.DataFrame. Is there a way to sum across columns?

Thanks in advance for your help.

803

asked Jun 03 '16 23:06

Sarah

2 Answers

You should try the following:

import org.apache.spark.sql.functions._

val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)

import sqlContext.implicits._

val input = sc.parallelize(Seq(
  ("a", 5, 7, 9, 12, 13),
  ("b", 6, 4, 3, 20, 17),
  ("c", 4, 9, 4, 6 , 9),
  ("d", 1, 2, 6, 8 , 1)
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")

val columnsToSum = List(col("var1"), col("var2"), col("var3"), col("var4"), col("var5"))

val output = input.withColumn("sums", columnsToSum.reduce(_ + _))

output.show()

Then the result is:

+---+----+----+----+----+----+----+
| ID|var1|var2|var3|var4|var5|sums|
+---+----+----+----+----+----+----+
|  a|   5|   7|   9|  12|  13|  46|
|  b|   6|   4|   3|  20|  17|  50|
|  c|   4|   9|   4|   6|   9|  32|
|  d|   1|   2|   6|   8|   1|  18|
+---+----+----+----+----+----+----+

137

answered Oct 22 '22 18:10

Paweł Jurczenko

Plain and simple:

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, col}

def sum_(cols: Column*) = cols.foldLeft(lit(0))(_ + _)

val columnstosum = Seq("var1", "var2", "var3", "var4", "var5").map(col _)
df.select(sum_(columnstosum: _*))

with Python equivalent:

from functools import reduce
from operator import add
from pyspark.sql.functions import lit, col

def sum_(*cols):
    return reduce(add, cols, lit(0))

columnstosum = [col(x) for x in ["var1", "var2", "var3", "var4", "var5"]]
select("*", sum_(*columnstosum))

Both will default to NA if there is a missing value in the row. You can use DataFrameNaFunctions.fill or coalesce function to avoid that.

answered Oct 22 '22 17:10

zero323

Related questions
                            
                                [] == ![] evaluates to true
                            
                                Passing parameters to middleware in Laravel
                            
                                Swift 3: UIImage when set to template image and changed tint color does not show image
                            
                                Angular2 + Typescript + FileReader.onLoad = property does not exist
                            
                                null value in entry: otherfileoutputs=null
                            
                                Using a DelegatingHandler in HttpClient class from windows forms - Inner handler has not been set
                            
                                React router v4 not working with Redux
                            
                                Property 'provideStore' does not exist on type 'typeof StoreModule'
                            
                                VS Code search.exclude doesn't work
                            
                                Android Studio 3.1 Canary 3 - Gradle sync failed
                            
                                Angular Animation For Dynamically Changing Height
                            
                                Install PHP Extension for PHP 5.6 on OSX with deprecated homebrew/php

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Adding a column of rowsums across a list of columns in Spark Dataframe

Tags:

dataframe

scala

apache-spark

apache-spark-sql

Sarah

People also ask

2 Answers

Paweł Jurczenko

zero323

Recent Activity

Donate For Us