Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How get the percentage of totals for each count after a groupBy in pyspark?

Tags:

pyspark

Given the following DataFrame:

import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark.createDataFrame([['a',1],['b', 2],['a', 3]], ['category', 'value'])
df.show()


+--------+-----+
|category|value|
+--------+-----+
|       a|    1|
|       b|    2|
|       a|    3|
+--------+-----+

I want to count the number of items in each category and provide a percentage of total for each count, like so

+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
|       b|    1|     0.333|
|       a|    2|     0.667|
+--------+-----+----------+
like image 779
RubenLaguna Avatar asked Sep 10 '18 20:09

RubenLaguna


2 Answers

You can obtain the count and percentage/ratio of totals with the following

import pyspark.sql.functions as f
from pyspark.sql.window import Window
df.groupBy('category').count()\
  .withColumn('percentage', f.round(f.col('count') / f.sum('count')\
  .over(Window.partitionBy()),3)).show()

+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
|       b|    1|     0.333|
|       a|    2|     0.667|
+--------+-----+----------+

The previous statement can be divided in steps. df.groupBy('category').count() produces the count:

+--------+-----+
|category|count|
+--------+-----+
|       b|    1|
|       a|    2|
+--------+-----+

then by applying window functions we can obtain the total count on each row:

df.groupBy('category').count().withColumn('total', f.sum('count').over(Window.partitionBy())).show()

+--------+-----+-----+
|category|count|total|
+--------+-----+-----+
|       b|    1|    3|
|       a|    2|    3|
+--------+-----+-----+

where the total column is calculated by adding together all the counts in the partition (a single partition that includes all rows).

Once we have count and total for each row we can calculate the ratio:

df.groupBy('category')\
  .count()\
  .withColumn('total', f.sum('count').over(Window.partitionBy()))\
  .withColumn('percentage',f.col('count')/f.col('total'))\
  .show()

+--------+-----+-----+------------------+
|category|count|total|        percentage|
+--------+-----+-----+------------------+
|       b|    1|    3|0.3333333333333333|
|       a|    2|    3|0.6666666666666666|
+--------+-----+-----+------------------+
like image 58
RubenLaguna Avatar answered Sep 28 '22 17:09

RubenLaguna


You can groupby and aggregate with agg:

import pyspark.sql.functions as F

df.groupby('category').agg(F.count('value') / df.count()).show()

Output:

+--------+------------------+
|category|(count(value) / 3)|
+--------+------------------+
|       b|0.3333333333333333|
|       a|0.6666666666666666|
+--------+------------------+

To make it nicer you can use:

df.groupby('category').agg(
    (
        F.round(F.count('value') / df.count(), 2)
    ).alias('ratio')
).show()

Output:

+--------+-----+
|category|ratio|
+--------+-----+
|       b| 0.33|
|       a| 0.67|
+--------+-----+

You can also use SQL:

df.createOrReplaceTempView('df')

spark.sql(
    """
    SELECT category, COUNT(*) / (SELECT COUNT(*) FROM df) AS ratio
    FROM df
    GROUP BY category
    """
).show()
like image 39
Mykola Zotko Avatar answered Sep 28 '22 17:09

Mykola Zotko