I want to divide the sum of two columns in pyspark. For example, I have a datasets like below:
A B C
1 1 2 3
2 1 2 3
3 1 2 3
What I want is to get sum of colA divide by sum of colB as below:
6 (Sum of colB) / 3 (Sum of colA) = 2
I have tried this:
sumofA = df.groupby().sum('A')
sumofB = df.groupby().sum('B')
Result = B / A
but it produces this error:
TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame'
Your approach was correct, but you could just do the calculation inside the aggregation function only.
from pyspark.sql import functions as F
df.groupBy().agg(F.sum("B")/F.sum("A")).show()
+-----------------+
|(sum(B) / sum(A))|
+-----------------+
| 2.0|
+-----------------+
OR, you can collect it as a value using collect()[0][0]
from pyspark.sql import functions as F
a=df.groupBy().agg(F.sum("B")/F.sum("A")).collect()[0][0]
a
Out[5]: 2.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With