Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Whats is the correct way to sum different dataframe columns in a list in pyspark?

I want to sum different columns in a spark dataframe.

Code

from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)

# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))

Why isn't approach #2. & #3. not working? I am on Spark 2.2

like image 898
GeorgeOfTheRF Avatar asked Dec 07 '17 08:12

GeorgeOfTheRF


People also ask

How to use Sumsum() in pyspark Dataframe?

sum () in PySpark returns the total (sum) value from a particular column in the DataFrame. We can get the sum value in three ways. Before that, we have to create PySpark DataFrame for demonstration. We will create a dataframe with 5 rows and 6 columns and display it using the show () method.

How to add multiple columns in pyspark Dataframe?

So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input. from pyspark.sql.functions import expr cols_list = ['a', 'b', 'c'] # Creating an addition expression using `join` expression = '+'.join (cols_list) df = df.withColumn ('sum_cols', expr (expression))

How do I sum the values of the rows in Dataframe?

The sum_stats column contains the sum of the row values across all columns. And so on. The following code shows how to sum the values of the rows across all columns in the DataFrame: The sum_stats column contains the sum of the row values across the ‘points’ and ‘assists’ columns. And so on.

What is sum_stats in a Dataframe?

The sum_stats column contains the sum of the row values across all columns. And so on. The following code shows how to sum the values of the rows across all columns in the DataFrame:


2 Answers

Because,

# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum

#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

Here you are using pyspark sum function which takes column as input but you are trying to get it at row level. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum

#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))

Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.

like image 129
Suresh Avatar answered Oct 22 '22 17:10

Suresh


TL;DR builtins.sum is just fine.


Following your comments:

Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it

and

its not a pypark function so it wont be really be completely benefiting from spark right.

I can see you are making incorrect assumptions.

Let's decompose the problem:

[df[col] for col in ["`A.p1`","`B.p1`"]]

creates a list of Columns:

[Column<b'A.p1'>, Column<b'B.p1'>]

Let's call it iterable.

sum reduces output by taking elements of this list and calling __add__ method (+). Imperative equivalent is:

accum = iterable[0]
for element in iterable[1:]:
    accum = accum + element

This gives Column:

Column<b'(A.p1 + B.p1)'>

which is the same as calling

df["`A.p1`"] + df["`B.p1`"]

No data has been touched and when evaluated it is benefits from all Spark optimizations.

like image 28
Alper t. Turker Avatar answered Oct 22 '22 15:10

Alper t. Turker