Find duplicates with groupby in Pandas

I read a csv file using Pandas. Then, I am checking to see if there are any duplicate rows in the data using the code below:

import pandas as pd

df= pd.read_csv("data.csv", na_values=["", " ", "-"])

print df.shape
>> (71644, 15)

print df.drop_duplicates().shape
>> (31171, 15)

I find that there are some duplicate rows, so I want to see which rows appear more than once:

data_groups = df.groupby(df.columns.tolist())
size = data_groups.size()
size[size > 1]

Doing that I get Series([], dtype: int64).

Futhermore, I can find the duplicate rows doing the following:

duplicates = df[(df.duplicated() == True)]

print duplicates.shape
>> (40473, 15)

So df.drop_duplicates() and df[(df.duplicated() == True)] show that there are duplicate rows but groupby doesn't.

My data consist of strings, integers, floats and nan.

Have I misunderstood something in the functions I mention above or something else happens ?

How do I check if two rows have the same value in pandas?

The equals() function is used to test whether two Pandas objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

What does Group_by do in pandas?

What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.

Can you use GroupBy with multiple columns in pandas?

groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

Simply add the reset_index() to realign aggregates to a new dataframe.

Additionally, the size() function creates an unmarked 0 column which you can use to filter for duplicate row. Then, just find length of resultant data frame to output a count of duplicates like other functions: drop_duplicates(), duplicated()==True.

data_groups = df.groupby(df.columns.tolist())
size = data_groups.size().reset_index() 
size[size[0] > 1]        # DATAFRAME OF DUPLICATES

len(size[size[0] > 1])   # NUMBER OF DUPLICATES

Find duplicates with groupby in Pandas

Tags:

python

pandas

IordanouGiannis

People also ask

1 Answers

Parfait

Recent Activity

Donate For Us

Find duplicates with groupby in Pandas

Tags:

python

pandas

IordanouGiannis

People also ask

1 Answers

Parfait

Related questions

Recent Activity

Donate For Us