Whats is the correct way to sum different dataframe columns in a list in pyspark?

Tags:

I want to sum different columns in a spark dataframe.

Code

from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)

# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))

Why isn't approach #2. & #3. not working? I am on Spark 2.2

898

asked Dec 07 '17 08:12

GeorgeOfTheRF

2 Answers

Because,

# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum

#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

Here you are using pyspark sum function which takes column as input but you are trying to get it at row level. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum

#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))

Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.

129

answered Oct 22 '22 17:10

Suresh

TL;DR builtins.sum is just fine.

Following your comments:

Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it

and

its not a pypark function so it wont be really be completely benefiting from spark right.

I can see you are making incorrect assumptions.

Let's decompose the problem:

[df[col] for col in ["`A.p1`","`B.p1`"]]

creates a list of Columns:

[Column<b'A.p1'>, Column<b'B.p1'>]

Let's call it iterable.

sum reduces output by taking elements of this list and calling __add__ method (+). Imperative equivalent is:

accum = iterable[0]
for element in iterable[1:]:
    accum = accum + element

This gives Column:

Column<b'(A.p1 + B.p1)'>

which is the same as calling

df["`A.p1`"] + df["`B.p1`"]

No data has been touched and when evaluated it is benefits from all Spark optimizations.

answered Oct 22 '22 15:10

Alper t. Turker

Related questions
                            
                                SMTP Authentication Error with Django on Heroku
                            
                                How to check dimensions of a numpy array?
                            
                                Change the text size of Bokeh label annotations
                            
                                How to perform bincount on an array of strings?
                            
                                How to import pyspark UDF into main class
                            
                                pandas function with isin
                            
                                Grouping by date range with pandas
                            
                                My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%
                            
                                Matplotlib show x-ticks on all subplots and unique y label
                            
                                Channels first with Keras?
                            
                                Comparing two arrays and getting the difference in PySpark
                            
                                Change the text of the inner tag using beautifulsoup python
                            
                                PyArray_Check gives Segmentation Fault with Cython/C++
                            
                                Replicate jupyter HTML output using IPython and Spyder instead
                            
                                Python - inspect.getmembers in source code order
                            
                                sklearn Hierarchical Agglomerative Clustering using similarity matrix
                            
                                Python Pandas group datetimes by hour and count row
                            
                                Starting docker container using python script
                            
                                Reordering nodes in increasing order in pandas dataframe
                            
                                How to create mysite.wsgi file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Whats is the correct way to sum different dataframe columns in a list in pyspark?

Tags:

python

apache-spark

apache-spark-sql

pyspark

pyspark-sql

GeorgeOfTheRF

People also ask

2 Answers

Suresh

Alper t. Turker

Recent Activity

Donate For Us