DataFrame.drop_duplicates and DataFrame.drop not removing rows

I have read in a csv into a pandas dataframe and it has five columns. Certain rows have duplicate values only in the second column, i want to remove these rows from the dataframe but neither drop nor drop_duplicates is working.

Here is my implementation:

#Read CSV
df = pd.read_csv(data_path, header=0, names=['a', 'b', 'c', 'd', 'e'])

print Series(df.b)

dropRows = []
#Sanitize the data to get rid of duplicates
for indx, val in enumerate(df.b): #for all the values
    if(indx == 0): #skip first indx
        continue

    if (val == df.b[indx-1]): #this is duplicate rtc value
        dropRows.append(indx)

print dropRows

df.drop(dropRows) #this doesnt work
df.drop_duplicates('b') #this doesnt work either

print Series(df.b)

when i print out the series df.b before and after they are the same length and I can visibly see the duplicates still. is there something wrong in my implementation?

Why drop duplicates is not working in Python?

If the date data is a pandas object dtype, the drop_duplicates will not work - do a pd. to_datetime first. Save this answer.

How do you drop rows from a data frame?

To drop a row or column in a dataframe, you need to use the drop() method available in the dataframe. You can read more about the drop() method in the docs here. Rows are labelled using the index number starting with 0, by default. Columns are labelled using names.

Does pandas automatically remove duplicates?

By default, it removes duplicate rows based on all columns. To remove duplicates on specific column(s), use subset . To remove duplicates and keep last occurrences, use keep .

How do I get rid of Panda repeats?

Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .

In my case the issue was that I was concatenating dfs with columns of different types:

import pandas as pd

s1 = pd.DataFrame([['a', 1]], columns=['letter', 'code'])
s2 = pd.DataFrame([['a', '1']], columns=['letter', 'code'])
df = pd.concat([s1, s2])
df = df.reset_index(drop=True)
df.drop_duplicates(inplace=True)

# 2 rows
print(df)

# int
print(type(df.at[0, 'code']))
# string
print(type(df.at[1, 'code']))

# Fix:
df['code'] = df['code'].astype(str)
df.drop_duplicates(inplace=True)

# 1 row
print(df)

As mentioned in the comments, drop and drop_duplicates creates a new DataFrame, unless provided with an inplace argument. All these options would work:

df = df.drop(dropRows)
df = df.drop_duplicates('b') #this doesnt work either
df.drop(dropRows, inplace = True)
df.drop_duplicates('b', inplace = True)

DataFrame.drop_duplicates and DataFrame.drop not removing rows

Tags:

python

pandas

user3123955

People also ask

2 Answers

johnecon

Korem

Recent Activity

Donate For Us

DataFrame.drop_duplicates and DataFrame.drop not removing rows

Tags:

python

pandas

user3123955

People also ask

2 Answers

johnecon

Korem

Related questions

Recent Activity

Donate For Us