I have a dataframe such as the following
In [94]: prova_df.show()
order_item_order_id order_item_subtotal
1 299.98
2 199.99
2 250.0
2 129.99
4 49.98
4 299.95
4 150.0
4 199.92
5 299.98
5 299.95
5 99.96
5 299.98
What I would like to do is to compute, for each different value of the first column, the sum over the corresponding values of the second column. I've tried doing this with the following code:
from pyspark.sql import functions as func
prova_df.groupBy("order_item_order_id").agg(func.sum("order_item_subtotal")).show()
Which gives an output
SUM('order_item_subtotal)
129.99000549316406
579.9500122070312
199.9499969482422
634.819995880127
434.91000747680664
Which I'm not so sure if it's doing the right thing. Why isn't it showing also the information from the first column? Thanks in advance for your answers
Group By returns a single row for each combination that is grouped together, and an aggregate function is used to compute the value from the grouped data. Let us see some Example of how PYSPARK GROUPBY MULTIPLE COLUMN function works:-
There are many ways that you can use to create a column in a PySpark Dataframe. I will try to show the most usable of them. The most pysparkish way to create a new column in a PySpark DataFrame is by using built-in functions.
We can partition the data column that contains group values and then use the aggregate function of sum() to get the sum of the grouping(partitioning) column. Syntax: dataframe.withColumn(‘New_Column_name’, functions.sum(‘column_name’).over(Window.partitionBy(‘column_name_group’)))
You can also Collect the PySpark DataFrame to Driver and iterate through Python, you can also use toLocalIterator (). In this article, you have learned iterating/loop through Rows of PySpark DataFrame could be done using map (), foreach (), converting to Pandas, and finally converting DataFrame to Python List.
A similar solution for your problem using PySpark 2.7.x would look like this:
df = spark.createDataFrame(
[(1, 299.98),
(2, 199.99),
(2, 250.0),
(2, 129.99),
(4, 49.98),
(4, 299.95),
(4, 150.0),
(4, 199.92),
(5, 299.98),
(5, 299.95),
(5, 99.96),
(5, 299.98)],
['order_item_order_id', 'order_item_subtotal'])
df.groupBy('order_item_order_id').sum('order_item_subtotal').show()
Which results in the following output:
+-------------------+------------------------+
|order_item_order_id|sum(order_item_subtotal)|
+-------------------+------------------------+
| 5| 999.8700000000001|
| 1| 299.98|
| 2| 579.98|
| 4| 699.85|
+-------------------+------------------------+
You can use partition in a window function for that:
from pyspark.sql import Window
df.withColumn("value_field", f.sum("order_item_subtotal") \
.over(Window.partitionBy("order_item_order_id"))) \
.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With