The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame? The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.

In <code>pySpark</code> you could do something like this, using <code>countDistinct()</code>: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.functions import col, countDistinct df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)) </code></pre> Similarly in <code>Scala</code> : <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.functions.countDistinct import org.apache.spark.sql.functions.col df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*) </code></pre> If you want to speed things up at the potential loss of accuracy, you could also use <code>approxCountDistinct()</code>.

Spark DataFrame: count distinct values of every column

1 Answers

In pySpark you could do something like this, using countDistinct():

Click to copy

from pyspark.sql.functions import col, countDistinct  df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))

Similarly in Scala :

Click to copy

import org.apache.spark.sql.functions.countDistinct import org.apache.spark.sql.functions.col  df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)

If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().

151

answered Sep 19 '22 21:09

mtoto

Related questions
                            
                                Spark Unable to find JDBC Driver
                            
                                Spark 2.0 missing spark implicits
                            
                                Use Spring together with Spark
                            
                                Does Spark support true column scans over parquet files in S3?
                            
                                scalac compile yields "object apache is not a member of package org"
                            
                                Spark-submit not working when application jar is in hdfs
                            
                                How can I force Spark to execute code?
                            
                                Why does Spark fail with "Detected cartesian product for INNER join between logical plans"?
                            
                                remove a column from a dataframe spark
                            
                                Primary keys with Apache Spark
                            
                                How to bin in PySpark?
                            
                                How to write to CSV in Spark
                            
                                fetch more than 20 rows and display full value of column in spark-shell
                            
                                Pyspark filter dataframe by columns of another dataframe
                            
                                Spark: How to translate count(distinct(value)) in Dataframe API's
                            
                                Do exit codes and exit statuses mean anything in spark?
                            
                                Apache Spark vs Apache Ignite [closed]
                            
                                How to load IPython shell with PySpark
                            
                                pyspark: count distinct over a window
                            
                                Calculating duration by subtracting two datetime columns in string format

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark DataFrame: count distinct values of every column

Tags:

distinct-values

apache-spark

apache-spark-sql

Rami

People also ask

1 Answers

mtoto

Recent Activity

Donate For Us