How can I sum multiple columns in Spark? For example, in SparkR the following code works to get the sum of one column, but if I try to get the sum of both columns in df
, I get an error.
# Create SparkDataFrame
df <- createDataFrame(faithful)
# Use agg to sum total waiting times
head(agg(df, totalWaiting = sum(df$waiting)))
##This works
# Use agg to sum total of waiting and eruptions
head(agg(df, total = sum(df$waiting, df$eruptions)))
##This doesn't work
Either SparkR or PySpark code will work.
In order to calculate sum of two or more columns in pyspark. we will be using + operator of the column to calculate sum of columns. Second method is to calculate sum of columns in pyspark and add it to the dataframe by using simple + operation along with select Function.
You can add multiple columns to Spark DataFrame in several ways if you wanted to add a known set of columns you can easily do by chaining withColumn() or on select(). However, sometimes you may need to add multiple columns after applying some transformations n that case you can use either map() or foldLeft().
Method -1 : Using select() method If we want to return the total value from multiple columns, we must use the sum() method inside the select() method by specifying the column name separated by a comma. Where, df is the input PySpark DataFrame. column_name is the column to get the sum value.
Using concat() Function to Concatenate DataFrame Columns Spark SQL functions provide concat() to concatenate two or more DataFrame columns into a single Column. It can also take columns of different Data Types and concatenate them into a single column. for example, it supports String, Int, Boolean and also arrays.
For PySpark, if you don't want to explicitly type out the columns:
from operator import add
from functools import reduce
new_df = df.withColumn('total',reduce(add, [F.col(x) for x in numeric_col_list]))
org.apache.spark.sql.functions.sum(Column e)
Aggregate function: returns the sum of all values in the expression.
As you can see, sum
takes just one column as input so sum(df$waiting, df$eruptions)
wont work.Since you wan to sum up the numeric fields, you can dosum(df("waiting") + df("eruptions"))
.If you wan to sum up values for individual columns then, you can df.agg(sum(df$waiting),sum(df$eruptions)).show
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With