Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark dataframe: Summing over a column while grouping over another

I have a dataframe such as the following

In [94]: prova_df.show()


order_item_order_id order_item_subtotal
1                   299.98             
2                   199.99             
2                   250.0              
2                   129.99             
4                   49.98              
4                   299.95             
4                   150.0              
4                   199.92             
5                   299.98             
5                   299.95             
5                   99.96              
5                   299.98             

What I would like to do is to compute, for each different value of the first column, the sum over the corresponding values of the second column. I've tried doing this with the following code:

from pyspark.sql import functions as func
prova_df.groupBy("order_item_order_id").agg(func.sum("order_item_subtotal")).show()

Which gives an output

SUM('order_item_subtotal)
129.99000549316406       
579.9500122070312        
199.9499969482422        
634.819995880127         
434.91000747680664 

Which I'm not so sure if it's doing the right thing. Why isn't it showing also the information from the first column? Thanks in advance for your answers

like image 609
Paolo Lami Avatar asked Nov 27 '15 16:11

Paolo Lami


People also ask

How pyspark groupby multiple column function works?

Group By returns a single row for each combination that is grouped together, and an aggregate function is used to compute the value from the grouped data. Let us see some Example of how PYSPARK GROUPBY MULTIPLE COLUMN function works:-

How to create a column in a pyspark Dataframe?

There are many ways that you can use to create a column in a PySpark Dataframe. I will try to show the most usable of them. The most pysparkish way to create a new column in a PySpark DataFrame is by using built-in functions.

How to get the sum of the grouping column in a Dataframe?

We can partition the data column that contains group values and then use the aggregate function of sum() to get the sum of the grouping(partitioning) column. Syntax: dataframe.withColumn(‘New_Column_name’, functions.sum(‘column_name’).over(Window.partitionBy(‘column_name_group’)))

How to iterate/loop through rows of pyspark Dataframe in Python?

You can also Collect the PySpark DataFrame to Driver and iterate through Python, you can also use toLocalIterator (). In this article, you have learned iterating/loop through Rows of PySpark DataFrame could be done using map (), foreach (), converting to Pandas, and finally converting DataFrame to Python List.


Video Answer


2 Answers

A similar solution for your problem using PySpark 2.7.x would look like this:

df = spark.createDataFrame(
    [(1, 299.98),
    (2, 199.99),
    (2, 250.0),
    (2, 129.99),
    (4, 49.98),
    (4, 299.95),
    (4, 150.0),
    (4, 199.92),
    (5, 299.98),
    (5, 299.95),
    (5, 99.96),
    (5, 299.98)],
    ['order_item_order_id', 'order_item_subtotal'])

df.groupBy('order_item_order_id').sum('order_item_subtotal').show()

Which results in the following output:

+-------------------+------------------------+
|order_item_order_id|sum(order_item_subtotal)|
+-------------------+------------------------+
|                  5|       999.8700000000001|
|                  1|                  299.98|
|                  2|                  579.98|
|                  4|                  699.85|
+-------------------+------------------------+
like image 185
Zac Roberts Avatar answered Oct 05 '22 23:10

Zac Roberts


You can use partition in a window function for that:

from pyspark.sql import Window

df.withColumn("value_field", f.sum("order_item_subtotal") \
  .over(Window.partitionBy("order_item_order_id"))) \
  .show()
like image 27
luminousmen Avatar answered Oct 05 '22 22:10

luminousmen