Given the following DataFrame:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark.createDataFrame([['a',1],['b', 2],['a', 3]], ['category', 'value'])
df.show()
+--------+-----+
|category|value|
+--------+-----+
| a| 1|
| b| 2|
| a| 3|
+--------+-----+
I want to count the number of items in each category and provide a percentage of total for each count, like so
+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
| b| 1| 0.333|
| a| 2| 0.667|
+--------+-----+----------+
You can obtain the count and percentage/ratio of totals with the following
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df.groupBy('category').count()\
.withColumn('percentage', f.round(f.col('count') / f.sum('count')\
.over(Window.partitionBy()),3)).show()
+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
| b| 1| 0.333|
| a| 2| 0.667|
+--------+-----+----------+
The previous statement can be divided in steps. df.groupBy('category').count()
produces the count
:
+--------+-----+
|category|count|
+--------+-----+
| b| 1|
| a| 2|
+--------+-----+
then by applying window functions we can obtain the total count on each row:
df.groupBy('category').count().withColumn('total', f.sum('count').over(Window.partitionBy())).show()
+--------+-----+-----+
|category|count|total|
+--------+-----+-----+
| b| 1| 3|
| a| 2| 3|
+--------+-----+-----+
where the total
column is calculated by adding together all the counts in the partition (a single partition that includes all rows).
Once we have count
and total
for each row we can calculate the ratio:
df.groupBy('category')\
.count()\
.withColumn('total', f.sum('count').over(Window.partitionBy()))\
.withColumn('percentage',f.col('count')/f.col('total'))\
.show()
+--------+-----+-----+------------------+
|category|count|total| percentage|
+--------+-----+-----+------------------+
| b| 1| 3|0.3333333333333333|
| a| 2| 3|0.6666666666666666|
+--------+-----+-----+------------------+
You can groupby
and aggregate with agg
:
import pyspark.sql.functions as F
df.groupby('category').agg(F.count('value') / df.count()).show()
Output:
+--------+------------------+
|category|(count(value) / 3)|
+--------+------------------+
| b|0.3333333333333333|
| a|0.6666666666666666|
+--------+------------------+
To make it nicer you can use:
df.groupby('category').agg(
(
F.round(F.count('value') / df.count(), 2)
).alias('ratio')
).show()
Output:
+--------+-----+
|category|ratio|
+--------+-----+
| b| 0.33|
| a| 0.67|
+--------+-----+
You can also use SQL:
df.createOrReplaceTempView('df')
spark.sql(
"""
SELECT category, COUNT(*) / (SELECT COUNT(*) FROM df) AS ratio
FROM df
GROUP BY category
"""
).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With