Is there a way to conditionally drop duplicates (using drop_duplicates specifically) in a pandas dataframe w/about 10 columns and 400,000 rows? That is, I want to keep all rows that have 2 columns meet a condition: if the combination of date (column) and store (column) # are unique, keep row, other wise, drop.
To remove duplicate values, click Data > Data Tools > Remove Duplicates. To highlight unique or duplicate values, use the Conditional Formatting command in the Style group on the Home tab.
If the order of the elements is not critical, we can remove duplicates using the Set method and the Numpy unique() function. We can use Pandas functions, OrderedDict, reduce() function, Set + sort() method, and iterative approaches to keep the order of elements.
The drop_duplicates() function is used to get Pandas series with duplicate values removed. 'first' : Drop duplicates except for the first occurrence. 'last' : Drop duplicates except for the last occurrence. False : Drop all duplicates.
Use drop_duplicates
to return dataframe with duplicate rows removed, optionally only considering certain columns
Let initial dataframe be like
In [34]: df
Out[34]:
Col1 Col2 Col3
0 A B 10
1 A B 20
2 A C 20
3 C B 20
4 A B 20
If you want to take unique combinations from certain columns 'Col1', 'Col2'
In [35]: df.drop_duplicates(['Col1', 'Col2'])
Out[35]:
Col1 Col2 Col3
0 A B 10
2 A C 20
3 C B 20
If you want to take unique combinations of all columns
In [36]: df.drop_duplicates()
Out[36]:
Col1 Col2 Col3
0 A B 10
1 A B 20
2 A C 20
3 C B 20
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With