Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Number of unique elements in all columns of a pyspark dataframe [duplicate]

How it is possible to calculate the number of unique elements in each column of a pyspark dataframe:

import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame([[1, 100], [1, 200], [2, 300], [3, 100], [4, 100], [4, 300]], columns=['col1', 'col2'])
df_spark = spark.createDataFrame(df)
print(df_spark.show())
# +----+----+
# |col1|col2|
# +----+----+
# |   1| 100|
# |   1| 200|
# |   2| 300|
# |   3| 100|
# |   4| 100|
# |   4| 300|
# +----+----+

# Some transformations on df_spark here

# How to get a number of unique elements (just a number) in each columns?

I know only the following solution which is very slow, both of these lines are calculated in the same amount of time:

col1_num_unique = df_spark.select('col1').distinct().count()
col2_num_unique = df_spark.select('col2').distinct().count()

There are about 10 millions rows in df_spark.

like image 305
Konstantin Avatar asked Dec 13 '18 13:12

Konstantin


People also ask

How do you count the number of unique values in a column PySpark?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.

How do I get distinct values for all columns in PySpark?

Distinct value of the column in pyspark is obtained by using select() function along with distinct() function. select() function takes up mutiple column names as argument, Followed by distinct() function will give distinct value of those columns combined.

How do you find unique values in PySpark?

In this article, we are going to display the distinct column values from dataframe using pyspark in Python. For this, we are using distinct() and dropDuplicates() functions along with select() function.

How do I use distinct on multiple columns in PySpark?

Spark doesn't have a distinct method that takes columns that should run distinct on however, Spark provides another signature of dropDuplicates() function which takes multiple columns to eliminate duplicates. Note that calling dropDuplicates() on DataFrame returns a new DataFrame with duplicate rows removed.


1 Answers

Try this:

from pyspark.sql.functions import col, countDistinct

df_spark.agg(*(countDistinct(col(c)).alias(c) for c in df_spark.columns))

EDIT: As @pault suggested, its an expensive operation and you can use approx_count_distinct() The one he suggested is currently deprecated (spark version >= 2.1)

like image 97
Manrique Avatar answered Sep 21 '22 00:09

Manrique