Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check for duplicate values in Pandas dataframe column

Is there a way in pandas to check if a dataframe column has duplicate values, without actually dropping rows? I have a function that will remove duplicate rows, however, I only want it to run if there are actually duplicates in a specific column.

Currently I compare the number of unique values in the column to the number of rows: if there are less unique values than rows then there are duplicates and the code runs.

 if len(df['Student'].unique()) < len(df.index):     # Code to remove duplicates based on Date column runs 

Is there an easier or more efficient way to check if duplicate values exist in a specific column, using pandas?

Some of the sample data I am working with (only two columns shown). If duplicates are found then another function identifies which row to keep (row with oldest date):

    Student Date 0   Joe     December 2017 1   James   January 2018 2   Bob     April 2018 3   Joe     December 2017 4   Jack    February 2018 5   Jack    March 2018 
like image 525
Jeff Mitchell Avatar asked May 08 '18 22:05

Jeff Mitchell


People also ask

How can check duplicate values in column in pandas?

To find duplicates on a specific column, we can simply call duplicated() method on the column. The result is a boolean Series with the value True denoting duplicate. In other words, the value True means the entry is identical to a previous one.

How do I find duplicates in a column in a data frame?

Code 1: Find duplicate columns in a DataFrame. To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set.


2 Answers

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗ ║ Student ║ Date          ║ ╠═════════╬═══════════════╣ ║ Joe     ║ December 2017 ║ ╠═════════╬═══════════════╣ ║ Bob     ║ April 2018    ║ ╠═════════╬═══════════════╣ ║ Joe     ║ December 2018 ║ ╚═════════╩═══════════════╝ 

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten) boolean = df['Student'].duplicated().any() # True 

Further reading and references

Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:

  1. drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
  2. duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.

These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:

boolean = df.duplicated(subset=['Student']).any() # True # We were expecting True, as Joe can be seen twice. 

However, if we are interested in the whole frame we could go ahead and do:

boolean = df.duplicated().any() # False boolean = df.duplicated(subset=['Student','Date']).any() # False # We were expecting False here - no duplicates row-wise  # ie. Joe Dec 2017, Joe Dec 2018 

And a final useful tip. By using the keep paramater we can normally skip a few rows directly accessing what we need:

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.

Example to play around with

import pandas as pd import io  data = '''\ Student,Date Joe,December 2017 Bob,April 2018 Joe,December 2018'''  df = pd.read_csv(io.StringIO(data), sep=',')  # Approach 1: Simple True/False boolean = df.duplicated(subset=['Student']).any() print(boolean, end='\n\n') # True  # Approach 2: First store boolean array, check then remove duplicate_in_student = df.duplicated(subset=['Student']) if duplicate_in_student.any():     print(df.loc[~duplicate_in_student], end='\n\n')  # Approach 3: Use drop_duplicates method df.drop_duplicates(subset=['Student'], inplace=True) print(df) 

Returns

True    Student           Date 0     Joe  December 2017 1     Bob     April 2018    Student           Date 0     Joe  December 2017 1     Bob     April 2018 
like image 169
Anton vBR Avatar answered Sep 17 '22 19:09

Anton vBR


You can use is_unique:

df['Student'].is_unique  # equals true in case of no duplicates 

Older pandas versions required:

pd.Series(df['Student']).is_unique 
like image 33
Carsten Avatar answered Sep 17 '22 19:09

Carsten