Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete pandas dataframe NaN rows selectively, grouped by id column which contains duplicates

I have a pandas dataframe like below: user id is a column which can contain duplicates. C1,C2,C3 are also columns.

I want to delete only those rows which has duplicated user column and have NaN for all values in C1,C2,C3 columns for those rows.

Expected output for this example: delete 1st row (user 1) as it has all NaN, but don't want to delete the row 3 (user 2) as it has only one instance (no duplicates). How can I accomplish it across all such rows?

user      C1         C2         C3

1       NaN        NaN        NaN
1       Nan        x          y
2       NaN        NaN        Nan
3        a          b         c
like image 612
Steve Smith Avatar asked Mar 01 '23 12:03

Steve Smith


1 Answers

You can do it like this

# Getting the count of each id
res = dict(df['id'].value_counts())
res

def check(idx):
'''
If the value at the given index has all the column values as NULL and the
occurrence of that id is greater than 1 then we return False (we don't 
want this row) otherwise, we return True (we want this row).
'''
    if df.loc[idx, 'temp'] == 3 and res[df.loc[idx, 'id']] > 1:
        return False
    else:
        return True

# temp row
df['temp'] = np.sum(df.isna(), axis=1)
df['temp and dup'] = df.index.map(check)
# Now we just select the rows we want.
df = df[df['temp and dup'] == True]
df.drop(columns=['temp', 'temp and dup'], inplace=True)
df

if it wolved your problem then give the green tick.

like image 50
Abhishek Prajapat Avatar answered Mar 29 '23 22:03

Abhishek Prajapat