I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns. Suppose my dataframe had columns "a", "b", and "c". I know I can do this: <pre class="prettyprint"><code>df.withColumn('total_col', df.a + df.b + df.c) </code></pre> The problem is that I don't want to type out each column individually and add them, especially if I have a lot of columns. I want to be able to do this automatically or by specifying a list of column names that I want to add. Is there another way to do this?

This was not obvious. I see no row-based sum of the columns defined in the spark Dataframes API. <h3>Version 2</h3> This can be done in a fairly simple way: <pre class="prettyprint"><code>newdf = df.withColumn('total', sum(df[col] for col in df.columns)) </code></pre> <code>df.columns</code> is supplied by pyspark as a list of strings giving all of the column names in the Spark Dataframe. For a different sum, you can supply any other list of column names instead. I did not try this as my first solution because I wasn't certain how it would behave. But it works. <h3>Version 1</h3> This is overly complicated, but works as well. You can do this: <ol> <li>use <code>df.columns</code> to get a list of the names of the columns</li> <li>use that names list to make a list of the columns</li> <li>pass that list to something that will invoke the column's overloaded add function in a fold-type functional manner </li> </ol> With python's reduce, some knowledge of how operator overloading works, and the pyspark code for columns here that becomes: <pre class="prettyprint"><code>def column_add(a,b): return a.__add__(b) newdf = df.withColumn('total_col', reduce(column_add, ( df[col] for col in df.columns ) )) </code></pre> Note this is a python reduce, not a spark RDD reduce, and the parenthesis term in the second parameter to reduce requires the parenthesis because it is a list generator expression. Tested, Works! <pre class="prettyprint"><code>$ pyspark >>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache() >>> df DataFrame[a: bigint, b: bigint, c: bigint] >>> df.columns ['a', 'b', 'c'] >>> def column_add(a,b): ... return a.__add__(b) ... >>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect() [Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)] </code></pre>

The most straight forward way of doing it is to use the <code>expr</code> function <pre class="prettyprint"><code>from pyspark.sql.functions import * data = data.withColumn('total', expr("col1 + col2 + col3 + col4")) </code></pre>

Add column sum as new column in PySpark dataframe

Tags:

python

apache-spark

pyspark

spark-dataframe

I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns.

Suppose my dataframe had columns "a", "b", and "c". I know I can do this:

df.withColumn('total_col', df.a + df.b + df.c)

The problem is that I don't want to type out each column individually and add them, especially if I have a lot of columns. I want to be able to do this automatically or by specifying a list of column names that I want to add. Is there another way to do this?

445

asked Aug 12 '15 02:08

plam

2 Answers

This was not obvious. I see no row-based sum of the columns defined in the spark Dataframes API.

Version 2

This can be done in a fairly simple way:

newdf = df.withColumn('total', sum(df[col] for col in df.columns))

df.columns is supplied by pyspark as a list of strings giving all of the column names in the Spark Dataframe. For a different sum, you can supply any other list of column names instead.

I did not try this as my first solution because I wasn't certain how it would behave. But it works.

Version 1

This is overly complicated, but works as well.

You can do this:

use df.columns to get a list of the names of the columns
use that names list to make a list of the columns
pass that list to something that will invoke the column's overloaded add function in a fold-type functional manner

With python's reduce, some knowledge of how operator overloading works, and the pyspark code for columns here that becomes:

def column_add(a,b):      return  a.__add__(b)  newdf = df.withColumn('total_col',           reduce(column_add, ( df[col] for col in df.columns ) ))

Note this is a python reduce, not a spark RDD reduce, and the parenthesis term in the second parameter to reduce requires the parenthesis because it is a list generator expression.

Tested, Works!

$ pyspark >>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache() >>> df DataFrame[a: bigint, b: bigint, c: bigint] >>> df.columns ['a', 'b', 'c'] >>> def column_add(a,b): ...     return a.__add__(b) ... >>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect() [Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)]

179

answered Sep 23 '22 01:09

Paul

The most straight forward way of doing it is to use the expr function

from pyspark.sql.functions import * data = data.withColumn('total', expr("col1 + col2 + col3 + col4"))

answered Sep 24 '22 01:09

Jonathan

Related questions
                            
                                Best and/or fastest way to create lists in python
                            
                                hashlib.md5() TypeError: Unicode-objects must be encoded before hashing
                            
                                django.core.servers.basehttp.FileWrapper disappears in Django 1.9
                            
                                Python: how to implement __getattr__()?
                            
                                Add edge-weights to plot output in networkx
                            
                                Standard deviation in numpy [duplicate]
                            
                                Django Rest Framework POST Update if existing or create
                            
                                __init__ vs __enter__ in context managers
                            
                                Is there an easy way to populate SlugField from CharField?
                            
                                Converting time zone pandas dataframe
                            
                                Single command in python to install relevant modules from a package.json like file
                            
                                Pandas OHLC aggregation on OHLC data
                            
                                Is there a way to return literally nothing in python?
                            
                                how to use concatenate a fixed string and a variable in Python
                            
                                SQLalchemy not find table for creating foreign key
                            
                                iPython/Jupyter Notebook and Pandas, how to plot multiple graphs in a for loop?
                            
                                Python enumerate() tqdm progress-bar when reading a file?
                            
                                Closing pyplot windows
                            
                                Can't start foreman in Heroku Tutorial using Python
                            
                                Pandas: Multilevel column names

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With