I have a dataset :
id url keep_if_dup
1 A.com Yes
2 A.com Yes
3 B.com No
4 B.com No
5 C.com No
I want to remove duplicates, i.e. keep first occurence of "url" field, BUT keep duplicates if the field "keep_if_dup" is YES.
Expected output :
id url keep_if_dup
1 A.com Yes
2 A.com Yes
3 B.com No
5 C.com No
What I tried :
Dataframe=Dataframe.drop_duplicates(subset='url', keep='first')
which of course does not take into account "keep_if_dup" field. Output is :
id url keep_if_dup
1 A.com Yes
3 B.com No
5 C.com No
To remove duplicate values, click Data > Data Tools > Remove Duplicates. To highlight unique or duplicate values, use the Conditional Formatting command in the Style group on the Home tab.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
By using pandas. DataFrame. drop_duplicates() method you can remove duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.
You can pass multiple boolean conditions to loc
, the first keeps all rows where col 'keep_if_dup' == 'Yes', this is or
ed (using |
) with the inverted boolean mask of whether col 'url' column is duplicated or not:
In [79]:
df.loc[(df['keep_if_dup'] =='Yes') | ~df['url'].duplicated()]
Out[79]:
id url keep_if_dup
0 1 A.com Yes
1 2 A.com Yes
2 3 B.com No
4 5 C.com No
to overwrite your df self-assign back:
df = df.loc[(df['keep_if_dup'] =='Yes') | ~df['url'].duplicated()]
breaking down the above shows the 2 boolean masks:
In [80]:
~df['url'].duplicated()
Out[80]:
0 True
1 False
2 True
3 False
4 True
Name: url, dtype: bool
In [81]:
df['keep_if_dup'] =='Yes'
Out[81]:
0 True
1 True
2 False
3 False
4 False
Name: keep_if_dup, dtype: bool
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With