Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count occurrences of each distinct value for every column in a dataframe?

edf.select("x").distinct.show() shows the distinct values that are present in x column of edf DataFrame.

Is there an efficient method to also show the number of times these distinct values occur in the data frame? (count for each distinct value)

like image 755
Leothorn Avatar asked Jun 21 '16 16:06

Leothorn


People also ask

How do you count unique values in a Dataframe column?

You can use the nunique() function to count the number of unique values in a pandas DataFrame.

How do I get the number of unique values in each column in pandas?

To get a count of unique values in a column use pandas, first use Series. unique() function to get unique values from column by removing duplidate values and then call the size to get the count. unique() function returns a ndarray with unique value in order of appearance and the results are not sorted.

How do you count the occurrences of a value in a pandas Dataframe column?

Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.


2 Answers

countDistinct is probably the first choice:

import org.apache.spark.sql.functions.countDistinct  df.agg(countDistinct("some_column")) 

If speed is more important than the accuracy you may consider approx_count_distinct (approxCountDistinct in Spark 1.x):

import org.apache.spark.sql.functions.approx_count_distinct  df.agg(approx_count_distinct("some_column")) 

To get values and counts:

df.groupBy("some_column").count() 

In SQL (spark-sql):

SELECT COUNT(DISTINCT some_column) FROM df 

and

SELECT approx_count_distinct(some_column) FROM df 
like image 81
zero323 Avatar answered Sep 30 '22 20:09

zero323


Roughly speaking, how it works:

enter image description here

enter image description here

like image 23
Saurav Sahu Avatar answered Sep 30 '22 18:09

Saurav Sahu