How it is possible to calculate the number of unique elements in each column of a pyspark dataframe:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame([[1, 100], [1, 200], [2, 300], [3, 100], [4, 100], [4, 300]], columns=['col1', 'col2'])
df_spark = spark.createDataFrame(df)
print(df_spark.show())
# +----+----+
# |col1|col2|
# +----+----+
# | 1| 100|
# | 1| 200|
# | 2| 300|
# | 3| 100|
# | 4| 100|
# | 4| 300|
# +----+----+
# Some transformations on df_spark here
# How to get a number of unique elements (just a number) in each columns?
I know only the following solution which is very slow, both of these lines are calculated in the same amount of time:
col1_num_unique = df_spark.select('col1').distinct().count()
col2_num_unique = df_spark.select('col2').distinct().count()
There are about 10 millions rows in df_spark
.
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
Distinct value of the column in pyspark is obtained by using select() function along with distinct() function. select() function takes up mutiple column names as argument, Followed by distinct() function will give distinct value of those columns combined.
In this article, we are going to display the distinct column values from dataframe using pyspark in Python. For this, we are using distinct() and dropDuplicates() functions along with select() function.
Spark doesn't have a distinct method that takes columns that should run distinct on however, Spark provides another signature of dropDuplicates() function which takes multiple columns to eliminate duplicates. Note that calling dropDuplicates() on DataFrame returns a new DataFrame with duplicate rows removed.
Try this:
from pyspark.sql.functions import col, countDistinct
df_spark.agg(*(countDistinct(col(c)).alias(c) for c in df_spark.columns))
EDIT:
As @pault suggested, its an expensive operation and you can use approx_count_distinct()
The one he suggested is currently deprecated (spark version >= 2.1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With