Pyspark dataframe: Summing over a column while grouping over another

Tags:

I have a dataframe such as the following

In [94]: prova_df.show()


order_item_order_id order_item_subtotal
1                   299.98             
2                   199.99             
2                   250.0              
2                   129.99             
4                   49.98              
4                   299.95             
4                   150.0              
4                   199.92             
5                   299.98             
5                   299.95             
5                   99.96              
5                   299.98

What I would like to do is to compute, for each different value of the first column, the sum over the corresponding values of the second column. I've tried doing this with the following code:

from pyspark.sql import functions as func
prova_df.groupBy("order_item_order_id").agg(func.sum("order_item_subtotal")).show()

Which gives an output

SUM('order_item_subtotal)
129.99000549316406       
579.9500122070312        
199.9499969482422        
634.819995880127         
434.91000747680664

Which I'm not so sure if it's doing the right thing. Why isn't it showing also the information from the first column? Thanks in advance for your answers

609

asked Nov 27 '15 16:11

Paolo Lami

Video Answer

2 Answers

A similar solution for your problem using PySpark 2.7.x would look like this:

df = spark.createDataFrame(
    [(1, 299.98),
    (2, 199.99),
    (2, 250.0),
    (2, 129.99),
    (4, 49.98),
    (4, 299.95),
    (4, 150.0),
    (4, 199.92),
    (5, 299.98),
    (5, 299.95),
    (5, 99.96),
    (5, 299.98)],
    ['order_item_order_id', 'order_item_subtotal'])

df.groupBy('order_item_order_id').sum('order_item_subtotal').show()

Which results in the following output:

+-------------------+------------------------+
|order_item_order_id|sum(order_item_subtotal)|
+-------------------+------------------------+
|                  5|       999.8700000000001|
|                  1|                  299.98|
|                  2|                  579.98|
|                  4|                  699.85|
+-------------------+------------------------+

185

answered Oct 05 '22 23:10

Zac Roberts

You can use partition in a window function for that:

from pyspark.sql import Window

df.withColumn("value_field", f.sum("order_item_subtotal") \
  .over(Window.partitionBy("order_item_order_id"))) \
  .show()

answered Oct 05 '22 22:10

luminousmen

Related questions
                            
                                ttk.Treeview - Can't change row height
                            
                                Python: ImportError: /usr/local/lib/python2.7/lib-dynload/_io.so: undefined symbol: PyUnicodeUCS2_Replace
                            
                                In Python, why does a negative number raised to an even power remain negative? [duplicate]
                            
                                Using WN-Affect to detect emotion/mood of a string
                            
                                Maybe monad in Python with method chaining
                            
                                Django UnitTest with Mock
                            
                                Run python behave from python instead of command line
                            
                                How to generate a valid sample token with stripe?
                            
                                How do I configure mathjax for iPython notebooks?
                            
                                Numpy: Filtering rows by multiple conditions?
                            
                                How to verify a JWT using python PyJWT with a public PEM cert?
                            
                                How to add a screenshot to allure report with python?
                            
                                Continue until all iterators are done Python
                            
                                numpy: fill offset diagonal with different values
                            
                                Concatenate several np arrays in python
                            
                                Iterating through a unicode string in Python
                            
                                Scrapy - No module named mail.smtp
                            
                                Python integer formatting
                            
                                python bin data and return bin midpoint (maybe using pandas.cut and qcut)
                            
                                How can I print the entire contents of Wordnet (preferably with NLTK)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark dataframe: Summing over a column while grouping over another

Tags:

python

apache-spark-sql

pyspark

pyspark-sql

apache-spark-1.3