I am working with a very large donation database of data with relevant columns for donation ID, conduit ID, amount, for example:
TRANSACTION_ID BACK_REFERENCE_TRAN_ID_NUMBER CONTRIBUTION_AMOUNT
0 VR0P4H2SEZ1 0 100
1 VR0P4H3X770 0 2700
2 VR0P4GY6QV1 0 500
3 VR0P4H3X720 0 1700
4 VR0P4GYHHA0 VR0P4GYHHA0E 200
What I need to do is to identify all of the rows where the TRANSACTION_ID corresponds to any BACK_REFERENCE_TRAN_ID_NUMBER. My current code, albeit a little clumsy, is:
is_from_conduit = df[df.BACK_REFERENCE_TRAN_ID_NUMBER != "0"].BACK_REFERENCE_TRAN_ID_NUMBER.tolist()
df['CONDUIT_FOR_OTHER_DONATION'] = 0
for row in df.index:
if df['TRANSACTION_ID'][row] in is_from_conduit:
df['CONDUIT_FOR_OTHER_DONATION'][row] = 1
else:
df['CONDUIT_FOR_OTHER_DONATION'][row] = 0
However, on very large data sets with a large number of conduit donations, this takes for ever. I know there must be a simpler way, but clearly I can't come up with how to phrase this to find out what that may be.
You can use Series.isin
. It is a vectorized operation that checks if each element of the Series is in a supplied iterable.
df['CONDUIT_FOR_OTHER_DONATION'] = df['TRANSACTION_ID'].isin(df['BACK_REFERENCE_TRAN_ID_NUMBER'].unique())
As @root mentioned if you prefer 0
/1
(as in your example) instead of True
/False
, you can cast to int
:
df['CONDUIT_FOR_OTHER_DONATION'] = df['TRANSACTION_ID'].isin(df['BACK_REFERENCE_TRAN_ID_NUMBER'].unique()).astype(int)
Here's a NumPy based approach using np.in1d
-
vals = np.in1d(df.TRANSACTION_ID,df.BACK_REFERENCE_TRAN_ID_NUMBER).astype(int)
df['CONDUIT_FOR_OTHER_DONATION'] = vals
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With