Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up cross-reference filtering in Pandas DB

I am working with a very large donation database of data with relevant columns for donation ID, conduit ID, amount, for example:

  TRANSACTION_ID BACK_REFERENCE_TRAN_ID_NUMBER  CONTRIBUTION_AMOUNT
0    VR0P4H2SEZ1                             0                  100
1    VR0P4H3X770                             0                 2700
2    VR0P4GY6QV1                             0                  500
3    VR0P4H3X720                             0                 1700
4    VR0P4GYHHA0                  VR0P4GYHHA0E                  200

What I need to do is to identify all of the rows where the TRANSACTION_ID corresponds to any BACK_REFERENCE_TRAN_ID_NUMBER. My current code, albeit a little clumsy, is:

is_from_conduit = df[df.BACK_REFERENCE_TRAN_ID_NUMBER != "0"].BACK_REFERENCE_TRAN_ID_NUMBER.tolist()
df['CONDUIT_FOR_OTHER_DONATION'] = 0
for row in df.index:
    if df['TRANSACTION_ID'][row] in is_from_conduit:
        df['CONDUIT_FOR_OTHER_DONATION'][row] = 1
    else:
        df['CONDUIT_FOR_OTHER_DONATION'][row] = 0

However, on very large data sets with a large number of conduit donations, this takes for ever. I know there must be a simpler way, but clearly I can't come up with how to phrase this to find out what that may be.

like image 927
whateveryousayiam Avatar asked Mar 11 '23 15:03

whateveryousayiam


2 Answers

You can use Series.isin. It is a vectorized operation that checks if each element of the Series is in a supplied iterable.

df['CONDUIT_FOR_OTHER_DONATION'] = df['TRANSACTION_ID'].isin(df['BACK_REFERENCE_TRAN_ID_NUMBER'].unique())

As @root mentioned if you prefer 0/1 (as in your example) instead of True/False, you can cast to int:

df['CONDUIT_FOR_OTHER_DONATION'] = df['TRANSACTION_ID'].isin(df['BACK_REFERENCE_TRAN_ID_NUMBER'].unique()).astype(int)
like image 101
Alex Avatar answered Mar 14 '23 22:03

Alex


Here's a NumPy based approach using np.in1d -

vals = np.in1d(df.TRANSACTION_ID,df.BACK_REFERENCE_TRAN_ID_NUMBER).astype(int)
df['CONDUIT_FOR_OTHER_DONATION'] = vals
like image 36
Divakar Avatar answered Mar 14 '23 23:03

Divakar