Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match rows in one Pandas dataframe to another based on three columns

I have two Pandas dataframes, one quite large (30000+ rows) and one a lot smaller (100+ rows).

The dfA looks something like:

      X     Y    ONSET_TIME    COLOUR 
0   104    78          1083         6    
1   172    78          1083        16
2   240    78          1083        15 
3   308    78          1083         8
4   376    78          1083         8
5   444    78          1083        14
6   512    78          1083        14
... ...   ...           ...       ...

The dfB looks something like:

    TIME     X     Y
0      7   512   350 
1   1722   512   214 
2   1906   376   214 
3   2095   376   146 
4   2234   308    78 
5   2406   172   146
...  ...   ...   ...  

What I want to do is for every row in dfB to find the row in dfA where the values of the X AND Y columns are equal AND that is the first row where the value of dfB['TIME'] is greater than dfA['ONSET_TIME'] and return the value of dfA['COLOUR'] for this row.

dfA represents refreshes of a display, where X and Y are coordinates of items on the display and so repeat themselves for every different ONSET_TIME (there are 108 pairs of coodinates for each value of ONSET_TIME).

There will be multiple rows where the X and Y in the two dataframes are equal, but I need the one that matches the time too.

I have done this using for loops and if statements just to see that it could be done, but obviously given the size of the dataframes this takes a very long time.

for s in range(0, len(dfA)):
    for r in range(0, len(dfB)):
        if (dfB.iloc[r,1] == dfA.iloc[s,0]) and (dfB.iloc[r,2] == dfA.iloc[s,1]) and (dfA.iloc[s,2] <= dfB.iloc[r,0] < dfA.iloc[s+108,2]):
            return dfA.iloc[s,3]
like image 620
Alex M-R Avatar asked Jul 14 '14 14:07

Alex M-R


People also ask

How can pandas select rows based on multiple conditions?

You can get pandas. Series of bool which is an AND of two conditions using & . Note that == and ~ are used here as the second condition for the sake of explanation, but you can use !=


2 Answers

There is probably an even more efficient way to do this, but here is a method without those slow for loops:

import pandas as pd

dfB = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3], 'Time':[10,20,30]})
dfA = pd.DataFrame({'X':[1,1,2,2,2,3],'Y':[1,1,2,2,2,3], 'ONSET_TIME':[5,7,9,16,22,28],'COLOR': ['Red','Blue','Blue','red','Green','Orange']})

#create one single table
mergeDf = pd.merge(dfA, dfB, left_on = ['X','Y'], right_on = ['X','Y'])
#remove rows where time is less than onset time
filteredDf = mergeDf[mergeDf['ONSET_TIME'] < mergeDf['Time']]
#take min time (closest to onset time)
groupedDf = filteredDf.groupby(['X','Y']).max()

print filteredDf

 COLOR  ONSET_TIME  X  Y  Time
0     Red           5  1  1    10
1    Blue           7  1  1    10
2    Blue           9  2  2    20
3     red          16  2  2    20
5  Orange          28  3  3    30


print groupedDf

COLOR  ONSET_TIME  Time
X Y                          
1 1     Red           7    10
2 2     red          16    20
3 3  Orange          28    30

The basic idea is to merge the two tables so you have the times together in one table. Then I filtered on the recs that are the largest (closest to the time on your dfB). Let me know if you have questions about this.

like image 169
flyingmeatball Avatar answered Oct 18 '22 05:10

flyingmeatball


Use merge() - it works like JOIN in SQL - and you have first part done.

d1 = '''      X     Y    ONSET_TIME    COLOUR 
   104    78          1083         6    
   172    78          1083        16
   240    78          1083        15 
   308    78          1083         8
   376    78          1083         8
   444    78          1083        14
   512    78          1083        14
   308    78          3000        14
   308    78          2000        14''' 


d2 = '''    TIME     X     Y
      7   512   350 
   1722   512   214 
   1906   376   214 
   2095   376   146 
   2234   308    78 
   2406   172   146'''

import pandas as pd
from StringIO import StringIO

dfA = pd.DataFrame.from_csv(StringIO(d1), sep='\s+', index_col=None)
#print dfA

dfB = pd.DataFrame.from_csv(StringIO(d2), sep='\s+', index_col=None)
#print dfB

df1 =  pd.merge(dfA, dfB, on=['X','Y'])
print df1

result:

     X   Y  ONSET_TIME  COLOUR  TIME
0  308  78        1083       8  2234
1  308  78        3000      14  2234
2  308  78        2000      14  2234

Then you can use it to filter results.

df2 = df1[ df1['ONSET_TIME'] < df1['TIME'] ]
print df2

result:

     X   Y  ONSET_TIME  COLOUR  TIME
0  308  78        1083       8  2234
2  308  78        2000      14  2234
like image 40
furas Avatar answered Oct 18 '22 03:10

furas