Consider the following dataframe
import pandas as pd
df = pd.DataFrame({'A' : [1, 2, 3, 3, 4, 4, 5, 6, 7],
'B' : ['a','b','c','c','d','d','e','f','g'],
'Col_1' :[np.NaN, 'A','A', np.NaN, 'B', np.NaN, 'B', np.NaN, np.NaN],
'Col_2' :[2,2,3,3,3,3,4,4,5]})
df
Out[92]:
A B Col_1 Col_2
0 1 a NaN 2
1 2 b A 2
2 3 c A 3
3 3 c NaN 3
4 4 d B 3
5 4 d NaN 3
6 5 e B 4
7 6 f NaN 4
8 7 g NaN 5
I want to remove all rows which are duplicates with regards to column 'A' 'B'
. I want to remove the entry which has a NaN
entry (I know that for all dulicates there will be a NaN
and a not-NaN
entry). The end results should look like this
A B Col_1 Col_2
0 1 a NaN 2
1 2 b A 2
2 3 c A 3
4 4 d B 3
6 5 e B 4
7 6 f NaN 4
8 7 g NaN 5
All efficient, one-liners are most welcome
If the goal is to only drop the NaN
duplicates, a slightly more involved solution is needed.
First, sort on A
, B
, and Col_1
, so NaN
s are moved to the bottom for each group. Then call df.drop_duplicates
with keep=first
:
out = df.sort_values(['A', 'B', 'Col_1']).drop_duplicates(['A', 'B'], keep='first')
print(out)
A B Col_1 Col_2
0 1 a NaN 2
1 2 b A 2
2 3 c A 3
4 4 d B 3
6 5 e B 4
7 6 f NaN 4
8 7 g NaN 5
Here's an alternative:
df[~((df[['A', 'B']].duplicated(keep=False)) & (df.isnull().any(axis=1)))]
# A B Col_1 Col_2
# 0 1 a NaN 2
# 1 2 b A 2
# 2 3 c A 3
# 4 4 d B 3
# 6 5 e B 4
# 7 6 f NaN 4
# 8 7 g NaN 5
This uses the bitwise "not" operator ~
to negate rows that meet the joint condition of being a duplicate row (the argument keep=False
causes the method to evaluate to True for all non-unique rows) and containing at least one null value. So where the expression df[['A', 'B']].duplicated(keep=False)
returns this Series:
# 0 False
# 1 False
# 2 True
# 3 True
# 4 True
# 5 True
# 6 False
# 7 False
# 8 False
...and the expression df.isnull().any(axis=1)
returns this Series:
# 0 True
# 1 False
# 2 False
# 3 True
# 4 False
# 5 True
# 6 False
# 7 True
# 8 True
... we wrap both in parentheses (required by Pandas syntax whenever using multiple expressions in indexing operations), and then wrap them in parentheses again so that we can negate the entire expression (i.e. ~( ... )
), like so:
~((df[['A','B']].duplicated(keep=False)) & (df.isnull().any(axis=1))) & (df['Col_2'] != 5)
# 0 True
# 1 True
# 2 True
# 3 False
# 4 True
# 5 False
# 6 True
# 7 True
# 8 False
You can build more complex conditions with further use of the logical operators &
and |
(the "or" operator). As with SQL, group your conditions as necessary with additional parentheses; for instance, filter based on the logic "both condition X AND condition Y are true, or condition Z is true" with df[ ( (X) & (Y) ) | (Z) ]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With