Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find duplicates with groupby in Pandas

Tags:

python

pandas

I read a csv file using Pandas. Then, I am checking to see if there are any duplicate rows in the data using the code below:

import pandas as pd

df= pd.read_csv("data.csv", na_values=["", " ", "-"])

print df.shape
>> (71644, 15)

print df.drop_duplicates().shape
>> (31171, 15)

I find that there are some duplicate rows, so I want to see which rows appear more than once:

data_groups = df.groupby(df.columns.tolist())
size = data_groups.size()
size[size > 1]

Doing that I get Series([], dtype: int64).

Futhermore, I can find the duplicate rows doing the following:

duplicates = df[(df.duplicated() == True)]

print duplicates.shape
>> (40473, 15)

So df.drop_duplicates() and df[(df.duplicated() == True)] show that there are duplicate rows but groupby doesn't.

My data consist of strings, integers, floats and nan.

Have I misunderstood something in the functions I mention above or something else happens ?

like image 753
IordanouGiannis Avatar asked Oct 19 '15 23:10

IordanouGiannis


People also ask

How do I check if two rows have the same value in pandas?

The equals() function is used to test whether two Pandas objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

What does Group_by do in pandas?

What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.

Can you use GroupBy with multiple columns in pandas?

groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.


1 Answers

Simply add the reset_index() to realign aggregates to a new dataframe.

Additionally, the size() function creates an unmarked 0 column which you can use to filter for duplicate row. Then, just find length of resultant data frame to output a count of duplicates like other functions: drop_duplicates(), duplicated()==True.

data_groups = df.groupby(df.columns.tolist())
size = data_groups.size().reset_index() 
size[size[0] > 1]        # DATAFRAME OF DUPLICATES

len(size[size[0] > 1])   # NUMBER OF DUPLICATES
like image 182
Parfait Avatar answered Sep 25 '22 06:09

Parfait