Pandas

Question

I have the following Pandas dataframe:

Index  Name  ID1  ID2  ID3
    1  A     Y    Y    Y
    2  B     Y    Y        
    3  B     Y              
    4  C               Y

I wish to add a new column 'Multiple' to indicate those rows where there is a value Y in more than one of the columns ID1, ID2, and ID3.

Index  Name  ID1  ID2  ID3 Multiple
    1  A     Y    Y    Y   Y
    2  B     Y    Y        Y
    3  B     Y             N
    4  C               Y   N

I'd normally use np.where or np.select e.g.:

df['multiple'] = np.where(<More than 1 of ID1, ID2 or ID3 have a Y in>), 'Y', 'N')

but I can't figure out how to write the conditional. There might be a growing number of ID columns so I couldn't cover every combination as a separate condition (e.g. (ID1 = Y and ID3 = Y) or (ID2 = Y and ID3 = Y). I think I perhaps want something which counts the Y values across named columns?

Outside of Pandas I would think about working with a list, appending the values for each column where Y and then see if the list had a length of greater than 1.

But I cant think how to do it within the limitations of np.where, np.select or df.loc. Any pointers?

Yuca · Accepted Answer

using numpy to sum by row to occurrences of Y should do it:

df['multi'] = ['Y' if x > 1 else 'N' for x in np.sum(df.values == 'Y', 1)]

output:

      Name ID1   ID2   ID3 multi
Index                           
1        A   Y     Y     Y     Y
2        B   Y     Y  None     Y
3        B   Y  None  None     N
4        C   Y  None  None     N

Pandas - check if a value exists in multiple columns for each row

Tags:

python

conditional-statements

MrDave

1 Answers

Yuca

Recent Activity

Donate For Us