I want to sum different columns in a spark dataframe.
Code
from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Why isn't approach #2. & #3. not working? I am on Spark 2.2
sum () in PySpark returns the total (sum) value from a particular column in the DataFrame. We can get the sum value in three ways. Before that, we have to create PySpark DataFrame for demonstration. We will create a dataframe with 5 rows and 6 columns and display it using the show () method.
So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input. from pyspark.sql.functions import expr cols_list = ['a', 'b', 'c'] # Creating an addition expression using `join` expression = '+'.join (cols_list) df = df.withColumn ('sum_cols', expr (expression))
The sum_stats column contains the sum of the row values across all columns. And so on. The following code shows how to sum the values of the rows across all columns in the DataFrame: The sum_stats column contains the sum of the row values across the ‘points’ and ‘assists’ columns. And so on.
The sum_stats column contains the sum of the row values across all columns. And so on. The following code shows how to sum the values of the rows across all columns in the DataFrame:
Because,
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using pyspark sum function which takes column as input but you are trying to get it at row level. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.
TL;DR builtins.sum
is just fine.
Following your comments:
Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it
and
its not a pypark function so it wont be really be completely benefiting from spark right.
I can see you are making incorrect assumptions.
Let's decompose the problem:
[df[col] for col in ["`A.p1`","`B.p1`"]]
creates a list of Columns
:
[Column<b'A.p1'>, Column<b'B.p1'>]
Let's call it iterable
.
sum
reduces output by taking elements of this list and calling __add__
method (+
). Imperative equivalent is:
accum = iterable[0]
for element in iterable[1:]:
accum = accum + element
This gives Column
:
Column<b'(A.p1 + B.p1)'>
which is the same as calling
df["`A.p1`"] + df["`B.p1`"]
No data has been touched and when evaluated it is benefits from all Spark optimizations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With