I have the following Pandas dataframe:
Index Name ID1 ID2 ID3
1 A Y Y Y
2 B Y Y
3 B Y
4 C Y
I wish to add a new column 'Multiple' to indicate those rows where there is a value Y in more than one of the columns ID1, ID2, and ID3.
Index Name ID1 ID2 ID3 Multiple
1 A Y Y Y Y
2 B Y Y Y
3 B Y N
4 C Y N
I'd normally use np.where
or np.select
e.g.:
df['multiple'] = np.where(<More than 1 of ID1, ID2 or ID3 have a Y in>), 'Y', 'N')
but I can't figure out how to write the conditional. There might be a growing number of ID columns so I couldn't cover every combination as a separate condition (e.g. (ID1 = Y and ID3 = Y) or (ID2 = Y and ID3 = Y)
. I think I perhaps want something which counts the Y values across named columns?
Outside of Pandas I would think about working with a list, appending the values for each column where Y and then see if the list had a length of greater than 1.
But I cant think how to do it within the limitations of np.where
, np.select
or df.loc
.
Any pointers?
using numpy to sum by row to occurrences of Y should do it:
df['multi'] = ['Y' if x > 1 else 'N' for x in np.sum(df.values == 'Y', 1)]
output:
Name ID1 ID2 ID3 multi
Index
1 A Y Y Y Y
2 B Y Y None Y
3 B Y None None N
4 C Y None None N
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With