How to calculate percentage over a dataframe

Question

I have a scenario simulating to a dataframe which looks something like below.

Area   Type    NrPeople     
1      House    200
1      Flat     100
2      House    300
2      Flat     400
3      House   1000
4      Flat     250

How want to calculate and return the Nr of people per area in descending order, but most important I struggle to calculate the overall percentage.

Result should look like this:

Area   SumPeople      %     
3       1000        44%
2        700        31%
1        300        13%
4        250        11%

See code sample below:

HouseDf = spark.createDataFrame([("1", "House", "200"), 
                              ("1", "Flat", "100"), 
                              ("2", "House", "300"), 
                              ("2", "Flat", "400"),
                              ("3", "House", "1000"), 
                              ("4", "Flat", "250")],
                              ["Area", "Type", "NrPeople"])

import pyspark.sql.functions as fn 
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')) 

Top = HouseDf\
    .groupBy('Area')\
    .agg(fn.sum('NrPeople').alias('SumPeople'))\
    .orderBy('SumPeople', ascending=False)\
    .withColumn('%', fn.lit(HouseDf.agg(fn.sum('NrPeople'))/Total.Total))\
Top.show()

This fails with: unsupported operand type(s) for /: 'int' and 'DataFrame'

Any ideas welcome how to do this !

Pushkr · Accepted Answer

You need window function-

import pyspark.sql.functions as fn 
from pyspark.sql.functions import rank,sum,col
from pyspark.sql import Window

window = Window.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)

HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('total',sum(col('SumPeople')).over(window))\
.withColumn('Percent',col('SumPeople')*100/col('total'))\
.drop(col('total')).show()

output :

+----+---------+------------------+
|Area|SumPeople|           Percent|
+----+---------+------------------+
|   3|   1000.0| 44.44444444444444|
|   2|    700.0| 31.11111111111111|
|   1|    300.0|13.333333333333334|
|   4|    250.0| 11.11111111111111|
+----+---------+------------------+

How to calculate percentage over a dataframe

Tags:

python

apache-spark

pyspark

Juanita Smith

1 Answers

Pushkr

Recent Activity

Donate For Us

How to calculate percentage over a dataframe

Tags:

python

apache-spark

pyspark

Juanita Smith

1 Answers

Pushkr

Related questions

Recent Activity

Donate For Us