Is there a way in pandas to check if a dataframe column has duplicate values, without actually dropping rows? I have a function that will remove duplicate rows, however, I only want it to run if there are actually duplicates in a specific column.
Currently I compare the number of unique values in the column to the number of rows: if there are less unique values than rows then there are duplicates and the code runs.
if len(df['Student'].unique()) < len(df.index): # Code to remove duplicates based on Date column runs
Is there an easier or more efficient way to check if duplicate values exist in a specific column, using pandas?
Some of the sample data I am working with (only two columns shown). If duplicates are found then another function identifies which row to keep (row with oldest date):
Student Date 0 Joe December 2017 1 James January 2018 2 Bob April 2018 3 Joe December 2017 4 Jack February 2018 5 Jack March 2018
To find duplicates on a specific column, we can simply call duplicated() method on the column. The result is a boolean Series with the value True denoting duplicate. In other words, the value True means the entry is identical to a previous one.
Code 1: Find duplicate columns in a DataFrame. To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set.
Is there a duplicate value in a column, True/False?
╔═════════╦═══════════════╗ ║ Student ║ Date ║ ╠═════════╬═══════════════╣ ║ Joe ║ December 2017 ║ ╠═════════╬═══════════════╣ ║ Bob ║ April 2018 ║ ╠═════════╬═══════════════╣ ║ Joe ║ December 2018 ║ ╚═════════╩═══════════════╝
Assuming above dataframe (df), we could do a quick check if duplicated in the Student
col by:
boolean = not df["Student"].is_unique # True (credit to @Carsten) boolean = df['Student'].duplicated().any() # True
Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:
These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:
boolean = df.duplicated(subset=['Student']).any() # True # We were expecting True, as Joe can be seen twice.
However, if we are interested in the whole frame we could go ahead and do:
boolean = df.duplicated().any() # False boolean = df.duplicated(subset=['Student','Date']).any() # False # We were expecting False here - no duplicates row-wise # ie. Joe Dec 2017, Joe Dec 2018
And a final useful tip. By using the keep
paramater we can normally skip a few rows directly accessing what we need:
keep : {‘first’, ‘last’, False}, default ‘first’
import pandas as pd import io data = '''\ Student,Date Joe,December 2017 Bob,April 2018 Joe,December 2018''' df = pd.read_csv(io.StringIO(data), sep=',') # Approach 1: Simple True/False boolean = df.duplicated(subset=['Student']).any() print(boolean, end='\n\n') # True # Approach 2: First store boolean array, check then remove duplicate_in_student = df.duplicated(subset=['Student']) if duplicate_in_student.any(): print(df.loc[~duplicate_in_student], end='\n\n') # Approach 3: Use drop_duplicates method df.drop_duplicates(subset=['Student'], inplace=True) print(df)
Returns
True Student Date 0 Joe December 2017 1 Bob April 2018 Student Date 0 Joe December 2017 1 Bob April 2018
You can use is_unique
:
df['Student'].is_unique # equals true in case of no duplicates
Older pandas versions required:
pd.Series(df['Student']).is_unique
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With