I have two Pandas dataframes, one quite large (30000+ rows) and one a lot smaller (100+ rows). The dfA looks something like: <pre class="prettyprint"><code> X Y ONSET_TIME COLOUR 0 104 78 1083 6 1 172 78 1083 16 2 240 78 1083 15 3 308 78 1083 8 4 376 78 1083 8 5 444 78 1083 14 6 512 78 1083 14 ... ... ... ... ... </code></pre> The dfB looks something like: <pre class="prettyprint"><code> TIME X Y 0 7 512 350 1 1722 512 214 2 1906 376 214 3 2095 376 146 4 2234 308 78 5 2406 172 146 ... ... ... ... </code></pre> What I want to do is for every row in dfB to find the row in dfA where the values of the X AND Y columns are equal AND that is the first row where the value of dfB['TIME'] is greater than dfA['ONSET_TIME'] and return the value of dfA['COLOUR'] for this row. dfA represents refreshes of a display, where X and Y are coordinates of items on the display and so repeat themselves for every different ONSET_TIME (there are 108 pairs of coodinates for each value of ONSET_TIME). There will be multiple rows where the X and Y in the two dataframes are equal, but I need the one that matches the time too. I have done this using for loops and if statements just to see that it could be done, but obviously given the size of the dataframes this takes a very long time. <pre class="prettyprint"><code>for s in range(0, len(dfA)): for r in range(0, len(dfB)): if (dfB.iloc[r,1] == dfA.iloc[s,0]) and (dfB.iloc[r,2] == dfA.iloc[s,1]) and (dfA.iloc[s,2] <= dfB.iloc[r,0] < dfA.iloc[s+108,2]): return dfA.iloc[s,3] </code></pre>

There is probably an even more efficient way to do this, but here is a method without those slow for loops: <pre class="prettyprint"><code>import pandas as pd dfB = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3], 'Time':[10,20,30]}) dfA = pd.DataFrame({'X':[1,1,2,2,2,3],'Y':[1,1,2,2,2,3], 'ONSET_TIME':[5,7,9,16,22,28],'COLOR': ['Red','Blue','Blue','red','Green','Orange']}) #create one single table mergeDf = pd.merge(dfA, dfB, left_on = ['X','Y'], right_on = ['X','Y']) #remove rows where time is less than onset time filteredDf = mergeDf[mergeDf['ONSET_TIME'] < mergeDf['Time']] #take min time (closest to onset time) groupedDf = filteredDf.groupby(['X','Y']).max() print filteredDf COLOR ONSET_TIME X Y Time 0 Red 5 1 1 10 1 Blue 7 1 1 10 2 Blue 9 2 2 20 3 red 16 2 2 20 5 Orange 28 3 3 30 print groupedDf COLOR ONSET_TIME Time X Y 1 1 Red 7 10 2 2 red 16 20 3 3 Orange 28 30 </code></pre> The basic idea is to merge the two tables so you have the times together in one table. Then I filtered on the recs that are the largest (closest to the time on your dfB). Let me know if you have questions about this.

Match rows in one Pandas dataframe to another based on three columns

Tags:

python

pandas

dataframe

I have two Pandas dataframes, one quite large (30000+ rows) and one a lot smaller (100+ rows).

The dfA looks something like:

      X     Y    ONSET_TIME    COLOUR 
0   104    78          1083         6    
1   172    78          1083        16
2   240    78          1083        15 
3   308    78          1083         8
4   376    78          1083         8
5   444    78          1083        14
6   512    78          1083        14
... ...   ...           ...       ...

The dfB looks something like:

    TIME     X     Y
0      7   512   350 
1   1722   512   214 
2   1906   376   214 
3   2095   376   146 
4   2234   308    78 
5   2406   172   146
...  ...   ...   ...

What I want to do is for every row in dfB to find the row in dfA where the values of the X AND Y columns are equal AND that is the first row where the value of dfB['TIME'] is greater than dfA['ONSET_TIME'] and return the value of dfA['COLOUR'] for this row.

dfA represents refreshes of a display, where X and Y are coordinates of items on the display and so repeat themselves for every different ONSET_TIME (there are 108 pairs of coodinates for each value of ONSET_TIME).

There will be multiple rows where the X and Y in the two dataframes are equal, but I need the one that matches the time too.

I have done this using for loops and if statements just to see that it could be done, but obviously given the size of the dataframes this takes a very long time.

for s in range(0, len(dfA)):
    for r in range(0, len(dfB)):
        if (dfB.iloc[r,1] == dfA.iloc[s,0]) and (dfB.iloc[r,2] == dfA.iloc[s,1]) and (dfA.iloc[s,2] <= dfB.iloc[r,0] < dfA.iloc[s+108,2]):
            return dfA.iloc[s,3]

620

asked Jul 14 '14 14:07

Alex M-R

2 Answers

There is probably an even more efficient way to do this, but here is a method without those slow for loops:

import pandas as pd

dfB = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3], 'Time':[10,20,30]})
dfA = pd.DataFrame({'X':[1,1,2,2,2,3],'Y':[1,1,2,2,2,3], 'ONSET_TIME':[5,7,9,16,22,28],'COLOR': ['Red','Blue','Blue','red','Green','Orange']})

#create one single table
mergeDf = pd.merge(dfA, dfB, left_on = ['X','Y'], right_on = ['X','Y'])
#remove rows where time is less than onset time
filteredDf = mergeDf[mergeDf['ONSET_TIME'] < mergeDf['Time']]
#take min time (closest to onset time)
groupedDf = filteredDf.groupby(['X','Y']).max()

print filteredDf

 COLOR  ONSET_TIME  X  Y  Time
0     Red           5  1  1    10
1    Blue           7  1  1    10
2    Blue           9  2  2    20
3     red          16  2  2    20
5  Orange          28  3  3    30


print groupedDf

COLOR  ONSET_TIME  Time
X Y                          
1 1     Red           7    10
2 2     red          16    20
3 3  Orange          28    30

The basic idea is to merge the two tables so you have the times together in one table. Then I filtered on the recs that are the largest (closest to the time on your dfB). Let me know if you have questions about this.

169

answered Oct 18 '22 05:10

flyingmeatball

Use merge() - it works like JOIN in SQL - and you have first part done.

d1 = '''      X     Y    ONSET_TIME    COLOUR 
   104    78          1083         6    
   172    78          1083        16
   240    78          1083        15 
   308    78          1083         8
   376    78          1083         8
   444    78          1083        14
   512    78          1083        14
   308    78          3000        14
   308    78          2000        14''' 


d2 = '''    TIME     X     Y
      7   512   350 
   1722   512   214 
   1906   376   214 
   2095   376   146 
   2234   308    78 
   2406   172   146'''

import pandas as pd
from StringIO import StringIO

dfA = pd.DataFrame.from_csv(StringIO(d1), sep='\s+', index_col=None)
#print dfA

dfB = pd.DataFrame.from_csv(StringIO(d2), sep='\s+', index_col=None)
#print dfB

df1 =  pd.merge(dfA, dfB, on=['X','Y'])
print df1

result:

     X   Y  ONSET_TIME  COLOUR  TIME
0  308  78        1083       8  2234
1  308  78        3000      14  2234
2  308  78        2000      14  2234

Then you can use it to filter results.

df2 = df1[ df1['ONSET_TIME'] < df1['TIME'] ]
print df2

result:

     X   Y  ONSET_TIME  COLOUR  TIME
0  308  78        1083       8  2234
2  308  78        2000      14  2234

answered Oct 18 '22 03:10

furas

Related questions
                            
                                What are the advantages of PyQt over PyGTK and vice-versa? [closed]
                            
                                Python 3 exception not printing new line
                            
                                Best way to calculate the fundamental matrix of an absorbing Markov Chain?
                            
                                celery worker does not retry task after calling retry()
                            
                                Scapy how get ping time?
                            
                                naming a file when downloading with Selenium Webdriver
                            
                                image/video processing options
                            
                                itertools.islice implementation -- efficiently slicing a list
                            
                                Python multiprocessing Queues reliability, Queue vs SimpleQueue vs JoinableQueue
                            
                                SciPy/NumPy import guideline
                            
                                Python 2.7 argparse: How to nest optional mutally exclusive arguments properly?
                            
                                Python: uWSGI configuration for NGINX+FLASK
                            
                                sqlalchemy, postgresql and relationship stuck in "idle in transaction"
                            
                                monte carlo simulation of protein structure and grid
                            
                                Finding which packages support Python 3.x vs 2.7.x
                            
                                Python cannot import module from subdirectory even with a file named __init.py__ in the directory
                            
                                Python - delete blank lines of text at the end of the file
                            
                                How to find the breakpoint numbers in pdb (ipdb)?
                            
                                Use Python to send keystrokes to games in Windows?
                            
                                SQLAlchemy session error: InvalidRequestError

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With