Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark DataFrame: count distinct values of every column

The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?

The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.

like image 421
Rami Avatar asked Nov 30 '16 12:11

Rami


People also ask

How do I count distinct values in spark DataFrame?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.

How do I use count in spark DataFrame?

For counting the number of distinct rows we are using distinct(). count() function which extracts the number of distinct rows from the Dataframe and storing it in the variable named as 'row' For counting the number of columns we are using df.

How do you find the unique elements in a column PySpark?

Distinct value of the column in pyspark is obtained by using select() function along with distinct() function. select() function takes up mutiple column names as argument, Followed by distinct() function will give distinct value of those columns combined.

How do you count unique in Scala?

distinct() runs distinct on all columns, if you want to get count distinct on selected columns, use the Spark SQL function countDistinct() . This function returns the number of distinct elements in a group.


1 Answers

In pySpark you could do something like this, using countDistinct():

from pyspark.sql.functions import col, countDistinct  df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)) 

Similarly in Scala :

import org.apache.spark.sql.functions.countDistinct import org.apache.spark.sql.functions.col  df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*) 

If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().

like image 151
mtoto Avatar answered Sep 19 '22 21:09

mtoto