I've got a list of column names I want to sum
columns = ['col1','col2','col3']
How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results)
Dataframe with result I want:
col1 col2 col3 result
1 2 3 6
Grouping on Multiple Columns in PySpark can be performed by passing two or more columns to the groupBy() method, this returns a pyspark. sql. GroupedData object which contains agg(), sum(), count(), min(), max(), avg() e.t.c to perform aggregations.
Method -1 : Using select() method If we want to return the total value from multiple columns, we must use the sum() method inside the select() method by specifying the column name separated by a comma. Where, df is the input PySpark DataFrame. column_name is the column to get the sum value.
In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns
will be list of columns from df.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With