Number of unique elements in all columns of a pyspark dataframe [duplicate]

Tags:

How it is possible to calculate the number of unique elements in each column of a pyspark dataframe:

import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame([[1, 100], [1, 200], [2, 300], [3, 100], [4, 100], [4, 300]], columns=['col1', 'col2'])
df_spark = spark.createDataFrame(df)
print(df_spark.show())
# +----+----+
# |col1|col2|
# +----+----+
# |   1| 100|
# |   1| 200|
# |   2| 300|
# |   3| 100|
# |   4| 100|
# |   4| 300|
# +----+----+

# Some transformations on df_spark here

# How to get a number of unique elements (just a number) in each columns?

I know only the following solution which is very slow, both of these lines are calculated in the same amount of time:

col1_num_unique = df_spark.select('col1').distinct().count()
col2_num_unique = df_spark.select('col2').distinct().count()

There are about 10 millions rows in df_spark.

305

asked Dec 13 '18 13:12

Konstantin

1 Answers

Try this:

from pyspark.sql.functions import col, countDistinct

df_spark.agg(*(countDistinct(col(c)).alias(c) for c in df_spark.columns))

EDIT: As @pault suggested, its an expensive operation and you can use approx_count_distinct() The one he suggested is currently deprecated (spark version >= 2.1)

answered Sep 21 '22 00:09

Manrique

Related questions
                            
                                How to set the line style for each kdeplot in a jointgrid
                            
                                Difference between load of librosa and read of scipy.io.wavfile
                            
                                "RuntimeError: This event loop is already running"; debugging aiohttp, asyncio and IDE "spyder3" in python 3.6.5
                            
                                running Scrapy but it error: No module named _util
                            
                                Keras does not use GPU - how to troubleshoot?
                            
                                How to add a suffix/prefix to a pandas.DataFrame().index?
                            
                                Easy Way to See if Two Columns are One-to-One in Pandas
                            
                                How to make the color of one end of colorbar darker in matplotlib?
                            
                                AttributeError: 'numpy.ndarray' object has no attribute 'drop'
                            
                                convert dataframe to be returned as "application-json" in flask python
                            
                                Python Dash Basic Auth - get username in app
                            
                                Package Python Pipenv project for AWS Lambda
                            
                                Pandas merge on `datetime` or `datetime` in `datetimeIndex`
                            
                                How can I upgrade pip inside a venv inside a Dockerfile?
                            
                                Extract encoder and decoder from trained autoencoder
                            
                                How can I find out which index is out of range?
                            
                                How to use asynchronous generator in Python 3.6?
                            
                                Keras give input to intermediate layer and get final output
                            
                                Using module as a singleton in Python - is that ok?
                            
                                How to turn off the "Special Variables" window in Python Console of PyCharm?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Number of unique elements in all columns of a pyspark dataframe [duplicate]

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

Konstantin

People also ask

1 Answers

Manrique

Recent Activity

Donate For Us