is_avail valu data_source
2015-08-07 False 0.282 source_a
2015-08-07 False 0.582 source_b
2015-08-23 False 0.296 source_a
2015-09-08 False 0.433 source_a
2015-10-01 True 0.169 source_b
In the dataframe above, I want to remove the duplicate rows (i.e. row where the index is repeated) by retaining the row with a higher value in the valu
column.
I can remove rows with duplicate indexes like this:
df = df[~df.index.duplicated()]
. But how to remove based on condition specified above?
Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) . Yields below output.
To remove duplicate values, click Data > Data Tools > Remove Duplicates. To highlight unique or duplicate values, use the Conditional Formatting command in the Style group on the Home tab.
drop_duplicates method to remove them. In the above code, we call . drop_duplicates() on the kitch_prod_df DataFrame with the inplace argument set to True. This allows us to modify the existing DataFrame instead of returning a new copy of the DataFrame with the duplicates removed.
You can use groupby on index after sorting the df by valu.
df.sort_values(by='valu', ascending=False).groupby(level=0).first()
Out[1277]:
is_avail valu data_source
2015-08-07 False 0.582 source_b
2015-08-23 False 0.296 source_a
2015-09-08 False 0.433 source_a
2015-10-01 True 0.169 source_b
Using drop_duplicates
with keep='last'
df.rename_axis('date').reset_index() \
.sort_values(['date', 'valu']) \
.drop_duplicates('date', keep='last') \
.set_index('date').rename_axis(df.index.name)
is_avail valu data_source
2015-08-07 False 0.582 source_b
2015-08-23 False 0.296 source_a
2015-09-08 False 0.433 source_a
2015-10-01 True 0.169 source_b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With