The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
For counting the number of distinct rows we are using distinct(). count() function which extracts the number of distinct rows from the Dataframe and storing it in the variable named as 'row' For counting the number of columns we are using df.
Distinct value of the column in pyspark is obtained by using select() function along with distinct() function. select() function takes up mutiple column names as argument, Followed by distinct() function will give distinct value of those columns combined.
distinct() runs distinct on all columns, if you want to get count distinct on selected columns, use the Spark SQL function countDistinct() . This function returns the number of distinct elements in a group.
In pySpark
you could do something like this, using countDistinct()
:
from pyspark.sql.functions import col, countDistinct df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala
:
import org.apache.spark.sql.functions.countDistinct import org.apache.spark.sql.functions.col df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With