I have a scenario simulating to a dataframe which looks something like below.
Area Type NrPeople
1 House 200
1 Flat 100
2 House 300
2 Flat 400
3 House 1000
4 Flat 250
How want to calculate and return the Nr of people per area in descending order, but most important I struggle to calculate the overall percentage.
Result should look like this:
Area SumPeople %
3 1000 44%
2 700 31%
1 300 13%
4 250 11%
See code sample below:
HouseDf = spark.createDataFrame([("1", "House", "200"),
("1", "Flat", "100"),
("2", "House", "300"),
("2", "Flat", "400"),
("3", "House", "1000"),
("4", "Flat", "250")],
["Area", "Type", "NrPeople"])
import pyspark.sql.functions as fn
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total'))
Top = HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('%', fn.lit(HouseDf.agg(fn.sum('NrPeople'))/Total.Total))\
Top.show()
This fails with: unsupported operand type(s) for /: 'int' and 'DataFrame'
Any ideas welcome how to do this !
You need window function-
import pyspark.sql.functions as fn
from pyspark.sql.functions import rank,sum,col
from pyspark.sql import Window
window = Window.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('total',sum(col('SumPeople')).over(window))\
.withColumn('Percent',col('SumPeople')*100/col('total'))\
.drop(col('total')).show()
output :
+----+---------+------------------+
|Area|SumPeople| Percent|
+----+---------+------------------+
| 3| 1000.0| 44.44444444444444|
| 2| 700.0| 31.11111111111111|
| 1| 300.0|13.333333333333334|
| 4| 250.0| 11.11111111111111|
+----+---------+------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With