How get the percentage of totals for each count after a groupBy in pyspark?

Question

Given the following DataFrame:

import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark.createDataFrame([['a',1],['b', 2],['a', 3]], ['category', 'value'])
df.show()


+--------+-----+
|category|value|
+--------+-----+
|       a|    1|
|       b|    2|
|       a|    3|
+--------+-----+

I want to count the number of items in each category and provide a percentage of total for each count, like so

+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
|       b|    1|     0.333|
|       a|    2|     0.667|
+--------+-----+----------+

RubenLaguna · Accepted Answer

You can obtain the count and percentage/ratio of totals with the following

import pyspark.sql.functions as f
from pyspark.sql.window import Window
df.groupBy('category').count()\
  .withColumn('percentage', f.round(f.col('count') / f.sum('count')\
  .over(Window.partitionBy()),3)).show()

+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
|       b|    1|     0.333|
|       a|    2|     0.667|
+--------+-----+----------+

The previous statement can be divided in steps. df.groupBy('category').count() produces the count:

+--------+-----+
|category|count|
+--------+-----+
|       b|    1|
|       a|    2|
+--------+-----+

then by applying window functions we can obtain the total count on each row:

df.groupBy('category').count().withColumn('total', f.sum('count').over(Window.partitionBy())).show()

+--------+-----+-----+
|category|count|total|
+--------+-----+-----+
|       b|    1|    3|
|       a|    2|    3|
+--------+-----+-----+

where the total column is calculated by adding together all the counts in the partition (a single partition that includes all rows).

Once we have count and total for each row we can calculate the ratio:

df.groupBy('category')\
  .count()\
  .withColumn('total', f.sum('count').over(Window.partitionBy()))\
  .withColumn('percentage',f.col('count')/f.col('total'))\
  .show()

+--------+-----+-----+------------------+
|category|count|total|        percentage|
+--------+-----+-----+------------------+
|       b|    1|    3|0.3333333333333333|
|       a|    2|    3|0.6666666666666666|
+--------+-----+-----+------------------+

Mykola Zotko · Answer

You can groupby and aggregate with agg:

import pyspark.sql.functions as F

df.groupby('category').agg(F.count('value') / df.count()).show()

Output:

+--------+------------------+
|category|(count(value) / 3)|
+--------+------------------+
|       b|0.3333333333333333|
|       a|0.6666666666666666|
+--------+------------------+

To make it nicer you can use:

df.groupby('category').agg(
    (
        F.round(F.count('value') / df.count(), 2)
    ).alias('ratio')
).show()

Output:

+--------+-----+
|category|ratio|
+--------+-----+
|       b| 0.33|
|       a| 0.67|
+--------+-----+

You can also use SQL:

df.createOrReplaceTempView('df')

spark.sql(
    """
    SELECT category, COUNT(*) / (SELECT COUNT(*) FROM df) AS ratio
    FROM df
    GROUP BY category
    """
).show()

How get the percentage of totals for each count after a groupBy in pyspark?

Tags:

pyspark

RubenLaguna

2 Answers

RubenLaguna

Mykola Zotko

Recent Activity

Donate For Us

How get the percentage of totals for each count after a groupBy in pyspark?

Tags:

pyspark

RubenLaguna

2 Answers

RubenLaguna

Mykola Zotko

Related questions

Recent Activity

Donate For Us