Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate rows in pandas dataframe based on condition

Tags:

python

pandas

            is_avail   valu data_source
2015-08-07     False  0.282    source_a
2015-08-07     False  0.582    source_b
2015-08-23     False  0.296    source_a
2015-09-08     False  0.433    source_a
2015-10-01      True  0.169    source_b

In the dataframe above, I want to remove the duplicate rows (i.e. row where the index is repeated) by retaining the row with a higher value in the valu column.

I can remove rows with duplicate indexes like this:

df = df[~df.index.duplicated()]. But how to remove based on condition specified above?

like image 467
user308827 Avatar asked May 05 '17 22:05

user308827


People also ask

How do you drop duplicates in Pandas with conditions?

Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) . Yields below output.

How do I remove duplicates based on condition?

To remove duplicate values, click Data > Data Tools > Remove Duplicates. To highlight unique or duplicate values, use the Conditional Formatting command in the Style group on the Home tab.

What is the method used for removing duplicate rows in your DataFrame?

drop_duplicates method to remove them. In the above code, we call . drop_duplicates() on the kitch_prod_df DataFrame with the inplace argument set to True. This allows us to modify the existing DataFrame instead of returning a new copy of the DataFrame with the duplicates removed.


2 Answers

You can use groupby on index after sorting the df by valu.

df.sort_values(by='valu', ascending=False).groupby(level=0).first()
Out[1277]: 
           is_avail   valu data_source
2015-08-07    False  0.582    source_b
2015-08-23    False  0.296    source_a
2015-09-08    False  0.433    source_a
2015-10-01     True  0.169    source_b
like image 90
Allen Avatar answered Sep 21 '22 03:09

Allen


Using drop_duplicates with keep='last'

df.rename_axis('date').reset_index() \
    .sort_values(['date', 'valu']) \
    .drop_duplicates('date', keep='last') \
    .set_index('date').rename_axis(df.index.name)

           is_avail   valu data_source
2015-08-07    False  0.582    source_b
2015-08-23    False  0.296    source_a
2015-09-08    False  0.433    source_a
2015-10-01     True  0.169    source_b
like image 36
piRSquared Avatar answered Sep 21 '22 03:09

piRSquared