Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I sum multiple columns in a spark dataframe in pyspark?

I've got a list of column names I want to sum

columns = ['col1','col2','col3']

How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results)

Dataframe with result I want:

col1   col2   col3   result
 1      2      3       6
like image 556
Manrique Avatar asked Nov 14 '18 10:11

Manrique


People also ask

How do I sum multiple columns in PySpark DataFrame?

Grouping on Multiple Columns in PySpark can be performed by passing two or more columns to the groupBy() method, this returns a pyspark. sql. GroupedData object which contains agg(), sum(), count(), min(), max(), avg() e.t.c to perform aggregations.

How do you sum columns in PySpark?

Method -1 : Using select() method If we want to return the total value from multiple columns, we must use the sum() method inside the select() method by specifying the column name separated by a comma. Where, df is the input PySpark DataFrame. column_name is the column to get the sum value.

How do I add values from two columns in PySpark?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do I select multiple columns in spark data frame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.


1 Answers

Try this:

df = df.withColumn('result', sum(df[col] for col in df.columns))

df.columns will be list of columns from df.

like image 94
Mayank Porwal Avatar answered Oct 22 '22 22:10

Mayank Porwal