Is there a simple and efficient way to check a python dataframe just for duplicates (not drop them) based on column(s)? I want to check if a dataframe has dups based on a combination of columns and if it does, fail the process. TIA.

The easiest way would be to check if the number of rows in the dataframe equals the number of rows after dropping duplicates. <pre class="prettyprint"><code>if df.count() > df.dropDuplicates([listOfColumns]).count(): raise ValueError('Data has duplicates') </code></pre>

If you also want to actually inspect the duplicates, you can do <pre class="prettyprint"><code>df \ .groupby(['column1', 'column2']) \ .count() \ .where('count > 1') \ .sort('count', ascending=False) \ .show() </code></pre>

check for duplicates in Pyspark Dataframe

2 Answers

The easiest way would be to check if the number of rows in the dataframe equals the number of rows after dropping duplicates.

if df.count() > df.dropDuplicates([listOfColumns]).count():
    raise ValueError('Data has duplicates')

120

answered Sep 27 '22 20:09

David

If you also want to actually inspect the duplicates, you can do

df \
    .groupby(['column1', 'column2']) \
    .count() \
    .where('count > 1') \
    .sort('count', ascending=False) \
    .show()

answered Sep 27 '22 22:09

Konstantin

Related questions
                            
                                Control wsgiref simple_server log
                            
                                How to Get an Average Pixel Value of a Gray Scale Image in Python Using PIL\Numpy?
                            
                                converting two digit integer into single digit inside a python list?
                            
                                json.loads() doesn't keep order [duplicate]
                            
                                Adjusting space around figure with subplots
                            
                                How to use OpenStreetMap background on Matplotlib Basemap
                            
                                how to measure the accuracy of knn classifier in python
                            
                                Remove range of columns in numpy array
                            
                                Python cannot find package h2o in anaconda
                            
                                boto3 file_upload does it check if file exists
                            
                                Django Admin DateTimeField Showing 24hr format time
                            
                                Is there a simple way to add a border to Kivy Buttons
                            
                                cc1: error: unrecognized command line option "-Wno-null-conversion" within installing python-mysql on mac 10.7.5
                            
                                Install OpenCV for Python (multiple python versions)
                            
                                Windows Python2.7 mysqldb installation error
                            
                                Fill the outside of contours OpenCV
                            
                                Python 2.7 on App Engine, simplejson vs native json, who's faster?
                            
                                Accept permission request in chrome using selenium
                            
                                How to save mouse position in variable using OpenCV and Python?
                            
                                How can use scrapy shell with url and basic auth credentials?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

check for duplicates in Pyspark Dataframe

Tags:

dataframe

python-2.7

pyspark

spark-dataframe

Prasanna Saraswathi Krishnan

People also ask

2 Answers

David

Konstantin

Recent Activity

Donate For Us