Calculate the mode of a PySpark DataFrame column?

Question

Ultimately what I want is the mode of a column, for all the columns in the DataFrame. For other summary statistics, I see a couple of options: use DataFrame aggregation, or map the columns of the DataFrame to an RDD of vectors (something I'm also having trouble doing) and use colStats from MLlib. But I don't see mode as an option there.

zero323 · Accepted Answer

A problem with mode is pretty much the same as with median. While it is easy to compute, computation is rather expensive. It can be done either using sort followed by local and global aggregations or using just-another-wordcount and filter:

import numpy as np
np.random.seed(1)

df = sc.parallelize([
    (int(x), ) for x in np.random.randint(50, size=10000)
]).toDF(["x"])

cnts = df.groupBy("x").count()
mode = cnts.join(
    cnts.agg(max("count").alias("max_")), col("count") == col("max_")
).limit(1).select("x")
mode.first()[0]
## 0

Either way it may require a full shuffle for each column.

Tesia · Answer

This line will give you the mode of "col" in spark data frame df:

df.groupby("col").count().orderBy("count", ascending=False).first()[0]

For a list of modes for all columns in df use:

[df.groupby(i).count().orderBy("count", ascending=False).first()[0] for i in df.columns]

To add names to identify which mode for which column, make 2D list:

[[i,df.groupby(i).count().orderBy("count", ascending=False).first()[0]] for i in df.columns]

Calculate the mode of a PySpark DataFrame column?

Tags:

python

apache-spark

apache-spark-sql

pyspark

RKD314

2 Answers

zero323

Tesia

Recent Activity

Donate For Us

Calculate the mode of a PySpark DataFrame column?

Tags:

python

apache-spark

apache-spark-sql

pyspark

RKD314

2 Answers

zero323

Tesia

Related questions

Recent Activity

Donate For Us