Is there a way in pandas to check if a dataframe column has duplicate values, without actually dropping rows? I have a function that will remove duplicate rows, however, I only want it to run if there are actually duplicates in a specific column. Currently I compare the number of unique values in the column to the number of rows: if there are less unique values than rows then there are duplicates and the code runs. <pre class="prettyprint"><code> if len(df['Student'].unique()) < len(df.index): # Code to remove duplicates based on Date column runs </code></pre> Is there an easier or more efficient way to check if duplicate values exist in a specific column, using pandas? Some of the sample data I am working with (only two columns shown). If duplicates are found then another function identifies which row to keep (row with oldest date): <pre class="prettyprint"><code> Student Date 0 Joe December 2017 1 James January 2018 2 Bob April 2018 3 Joe December 2017 4 Jack February 2018 5 Jack March 2018 </code></pre>

<h3>Main question</h3> <blockquote> Is there a duplicate value in a column, True/False? </blockquote> <pre class="prettyprint"><code>╔═════════╦═══════════════╗ ║ Student ║ Date ║ ╠═════════╬═══════════════╣ ║ Joe ║ December 2017 ║ ╠═════════╬═══════════════╣ ║ Bob ║ April 2018 ║ ╠═════════╬═══════════════╣ ║ Joe ║ December 2018 ║ ╚═════════╩═══════════════╝ </code></pre> Assuming above dataframe (df), we could do a quick check if duplicated in the <code>Student</code> col by: <pre class="prettyprint"><code>boolean = not df["Student"].is_unique # True (credit to @Carsten) boolean = df['Student'].duplicated().any() # True </code></pre> <hr> <h3>Further reading and references</h3> Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are: <ol> <li> drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns. </li> <li> duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns. </li> </ol> These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be: <pre class="prettyprint"><code>boolean = df.duplicated(subset=['Student']).any() # True # We were expecting True, as Joe can be seen twice. </code></pre> However, if we are interested in the whole frame we could go ahead and do: <pre class="prettyprint"><code>boolean = df.duplicated().any() # False boolean = df.duplicated(subset=['Student','Date']).any() # False # We were expecting False here - no duplicates row-wise # ie. Joe Dec 2017, Joe Dec 2018 </code></pre> And a final useful tip. By using the <code>keep</code> paramater we can normally skip a few rows directly accessing what we need: <blockquote> keep : {‘first’, ‘last’, False}, default ‘first’ </blockquote> <ul> <li>first : Drop duplicates except for the first occurrence.</li> <li>last : Drop duplicates except for the last occurrence.</li> <li>False : Drop all duplicates.</li> </ul> <hr> <h3>Example to play around with</h3> <pre class="prettyprint"><code>import pandas as pd import io data = '''\ Student,Date Joe,December 2017 Bob,April 2018 Joe,December 2018''' df = pd.read_csv(io.StringIO(data), sep=',') # Approach 1: Simple True/False boolean = df.duplicated(subset=['Student']).any() print(boolean, end='\n\n') # True # Approach 2: First store boolean array, check then remove duplicate_in_student = df.duplicated(subset=['Student']) if duplicate_in_student.any(): print(df.loc[~duplicate_in_student], end='\n\n') # Approach 3: Use drop_duplicates method df.drop_duplicates(subset=['Student'], inplace=True) print(df) </code></pre> Returns <pre class="prettyprint"><code>True Student Date 0 Joe December 2017 1 Bob April 2018 Student Date 0 Joe December 2017 1 Bob April 2018 </code></pre>

You can use <code>is_unique</code>: <pre class="prettyprint"><code>df['Student'].is_unique # equals true in case of no duplicates </code></pre> Older pandas versions required: <pre class="prettyprint"><code>pd.Series(df['Student']).is_unique </code></pre>

Check for duplicate values in Pandas dataframe column

Tags:

python

pandas

dataframe

duplicates

Is there a way in pandas to check if a dataframe column has duplicate values, without actually dropping rows? I have a function that will remove duplicate rows, however, I only want it to run if there are actually duplicates in a specific column.

Currently I compare the number of unique values in the column to the number of rows: if there are less unique values than rows then there are duplicates and the code runs.

 if len(df['Student'].unique()) < len(df.index):     # Code to remove duplicates based on Date column runs

Is there an easier or more efficient way to check if duplicate values exist in a specific column, using pandas?

Some of the sample data I am working with (only two columns shown). If duplicates are found then another function identifies which row to keep (row with oldest date):

    Student Date 0   Joe     December 2017 1   James   January 2018 2   Bob     April 2018 3   Joe     December 2017 4   Jack    February 2018 5   Jack    March 2018

525

asked May 08 '18 22:05

Jeff Mitchell

2 Answers

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗ ║ Student ║ Date          ║ ╠═════════╬═══════════════╣ ║ Joe     ║ December 2017 ║ ╠═════════╬═══════════════╣ ║ Bob     ║ April 2018    ║ ╠═════════╬═══════════════╣ ║ Joe     ║ December 2018 ║ ╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten) boolean = df['Student'].duplicated().any() # True

Example to play around with

import pandas as pd import io  data = '''\ Student,Date Joe,December 2017 Bob,April 2018 Joe,December 2018'''  df = pd.read_csv(io.StringIO(data), sep=',')  # Approach 1: Simple True/False boolean = df.duplicated(subset=['Student']).any() print(boolean, end='\n\n') # True  # Approach 2: First store boolean array, check then remove duplicate_in_student = df.duplicated(subset=['Student']) if duplicate_in_student.any():     print(df.loc[~duplicate_in_student], end='\n\n')  # Approach 3: Use drop_duplicates method df.drop_duplicates(subset=['Student'], inplace=True) print(df)

Returns

True    Student           Date 0     Joe  December 2017 1     Bob     April 2018    Student           Date 0     Joe  December 2017 1     Bob     April 2018

169

answered Sep 17 '22 19:09

Anton vBR

You can use is_unique:

df['Student'].is_unique  # equals true in case of no duplicates

Older pandas versions required:

pd.Series(df['Student']).is_unique

answered Sep 17 '22 19:09

Carsten

Related questions
                            
                                What is the opposite of python's ord() function?
                            
                                Can I get PyCharm to suppress a particular warning on a single line?
                            
                                How does one debug NaN values in TensorFlow?
                            
                                What does `**` mean in the expression `dict(d1, **d2)`?
                            
                                What are the advantages of using numpy.identity over numpy.eye?
                            
                                Unable to get local issuer certificate when using requests in python
                            
                                Run code after flask application has started
                            
                                How to get matplotlib figure size
                            
                                Python: 'Private' module in a package
                            
                                What's the exact usage of __reduce__ in Pickler
                            
                                Import statement inside class/function definition - is it a good idea?
                            
                                Hide traceback unless a debug flag is set
                            
                                Why is using thread locals in Django bad?
                            
                                Using print() (the function version) in Python2.x
                            
                                What are the differences between Python Dictionaries vs Javascript Objects?
                            
                                Modern, high performance bloom filter in Python? [closed]
                            
                                Is it always safe to modify the `**kwargs` dictionary?
                            
                                What is print(f"...")
                            
                                Python urllib vs httplib?
                            
                                Iterating over key/value pairs in a dict sorted by keys

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With