I read a csv file using Pandas. Then, I am checking to see if there are any duplicate rows in the data using the code below:
import pandas as pd
df= pd.read_csv("data.csv", na_values=["", " ", "-"])
print df.shape
>> (71644, 15)
print df.drop_duplicates().shape
>> (31171, 15)
I find that there are some duplicate rows, so I want to see which rows appear more than once:
data_groups = df.groupby(df.columns.tolist())
size = data_groups.size()
size[size > 1]
Doing that I get Series([], dtype: int64)
.
Futhermore, I can find the duplicate rows doing the following:
duplicates = df[(df.duplicated() == True)]
print duplicates.shape
>> (40473, 15)
So df.drop_duplicates()
and df[(df.duplicated() == True)]
show that there are duplicate rows but groupby
doesn't.
My data consist of strings, integers, floats and nan.
Have I misunderstood something in the functions I mention above or something else happens ?
The equals() function is used to test whether two Pandas objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
Simply add the reset_index()
to realign aggregates to a new dataframe.
Additionally, the size()
function creates an unmarked 0 column which you can use to filter for duplicate row. Then, just find length of resultant data frame to output a count of duplicates like other functions: drop_duplicates()
, duplicated()==True
.
data_groups = df.groupby(df.columns.tolist())
size = data_groups.size().reset_index()
size[size[0] > 1] # DATAFRAME OF DUPLICATES
len(size[size[0] > 1]) # NUMBER OF DUPLICATES
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With