I have a pandas dataframe like below: user
id is a column which can contain duplicates. C1,C2,C3
are also columns.
I want to delete only those rows which has duplicated user
column and have NaN for all values in C1,C2,C3
columns for those rows.
Expected output for this example: delete 1st row (user 1) as it has all NaN, but don't want to delete the row 3 (user 2) as it has only one instance (no duplicates). How can I accomplish it across all such rows?
user C1 C2 C3
1 NaN NaN NaN
1 Nan x y
2 NaN NaN Nan
3 a b c
You can do it like this
# Getting the count of each id
res = dict(df['id'].value_counts())
res
def check(idx):
'''
If the value at the given index has all the column values as NULL and the
occurrence of that id is greater than 1 then we return False (we don't
want this row) otherwise, we return True (we want this row).
'''
if df.loc[idx, 'temp'] == 3 and res[df.loc[idx, 'id']] > 1:
return False
else:
return True
# temp row
df['temp'] = np.sum(df.isna(), axis=1)
df['temp and dup'] = df.index.map(check)
# Now we just select the rows we want.
df = df[df['temp and dup'] == True]
df.drop(columns=['temp', 'temp and dup'], inplace=True)
df
if it wolved your problem then give the green tick.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With