edf.select("x").distinct.show()
shows the distinct values that are present in x
column of edf
DataFrame.
Is there an efficient method to also show the number of times these distinct values occur in the data frame? (count for each distinct value)
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
To get a count of unique values in a column use pandas, first use Series. unique() function to get unique values from column by removing duplidate values and then call the size to get the count. unique() function returns a ndarray with unique value in order of appearance and the results are not sorted.
Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.
countDistinct
is probably the first choice:
import org.apache.spark.sql.functions.countDistinct df.agg(countDistinct("some_column"))
If speed is more important than the accuracy you may consider approx_count_distinct
(approxCountDistinct
in Spark 1.x):
import org.apache.spark.sql.functions.approx_count_distinct df.agg(approx_count_distinct("some_column"))
To get values and counts:
df.groupBy("some_column").count()
In SQL (spark-sql
):
SELECT COUNT(DISTINCT some_column) FROM df
and
SELECT approx_count_distinct(some_column) FROM df
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With