Fastest way to compare row and previous row in pandas dataframe with millions of rows

Tags:

I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row.

As an example, this is a simplified version of my problem:

   User  Time                 Col1  newcol1  newcol2  newcol3  newcol4
0     1     6     [cat, dog, goat]        0        0        0        0
1     1     6         [cat, sheep]        0        0        0        0
2     1    12        [sheep, goat]        0        0        0        0
3     2     3          [cat, lion]        0        0        0        0
4     2     5  [fish, goat, lemur]        0        0        0        0
5     3     9           [cat, dog]        0        0        0        0
6     4     4          [dog, goat]        0        0        0        0
7     4    11                [cat]        0        0        0        0

At the moment I have a function which loops through and calculates values for 'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1. It also looks at the first value in the arrays stored in 'Col1' and 'Col2' and updates 'newcol3' and 'newcol4' if these values have changed since the previous row.

Here's the pseudo-code for what I'm doing currently (since I've simplified the problem I haven't tested this but it's pretty similar to what I'm actually doing in ipython notebook):

 def myJFunc(df):
...     #initialize jnum counter
...     jnum = 0;
...     #loop through each row of dataframe (not including the first/zeroeth)
...     for i in range(1,len(df)):
...             #has user changed?
...             if df.User.loc[i] == df.User.loc[i-1]:
...                     #has time increased by more than 1 (hour)?
...                     if abs(df.Time.loc[i]-df.Time.loc[i-1])>1:
...                             #update new columns
...                             df['newcol2'].loc[i-1] = 1;
...                             df['newcol1'].loc[i] = 1;
...                             #increase jnum
...                             jnum += 1;
...                     #has content changed?
...                     if df.Col1.loc[i][0] != df.Col1.loc[i-1][0]:
...                             #record this change
...                             df['newcol4'].loc[i-1] = [df.Col1.loc[i-1][0], df.Col2.loc[i][0]];
...             #different user?
...             elif df.User.loc[i] != df.User.loc[i-1]:
...                     #update new columns
...                     df['newcol1'].loc[i] = 1; 
...                     df['newcol2'].loc[i-1] = 1;
...                     #store jnum elsewhere (code not included here) and reset jnum
...                     jnum = 1;

I now need to apply this function to several million rows and it's impossibly slow so I'm trying to figure out the best way to speed it up. I've heard that Cython can increase the speed of functions but I have no experience with it (and I'm new to both pandas and python). Is it possible to pass two rows of a dataframe as arguments to the function and then use Cython to speed it up or would it be necessary to create new columns with "diff" values in them so that the function only reads from and writes to one row of the dataframe at a time, in order to benefit from using Cython? Any other speed tricks would be greatly appreciated!

(As regards using .loc, I compared .loc, .iloc and .ix and this one was marginally faster so that's the only reason I'm using that currently)

(Also, my User column in reality is unicode not int, which could be problematic for speedy comparisons)

668

asked Apr 04 '15 13:04

AdO

2 Answers

I was thinking along the same lines as Andy, just with groupby added, and I think this is complementary to Andy's answer. Adding groupby is just going to have the effect of putting a NaN in the first row whenever you do a diff or shift. (Note that this is not an attempt at an exact answer, just to sketch out some basic techniques.)

df['time_diff'] = df.groupby('User')['Time'].diff()

df['Col1_0'] = df['Col1'].apply( lambda x: x[0] )

df['Col1_0_prev'] = df.groupby('User')['Col1_0'].shift()

   User  Time                 Col1  time_diff Col1_0 Col1_0_prev
0     1     6     [cat, dog, goat]        NaN    cat         NaN
1     1     6         [cat, sheep]          0    cat         cat
2     1    12        [sheep, goat]          6  sheep         cat
3     2     3          [cat, lion]        NaN    cat         NaN
4     2     5  [fish, goat, lemur]          2   fish         cat
5     3     9           [cat, dog]        NaN    cat         NaN
6     4     4          [dog, goat]        NaN    dog         NaN
7     4    11                [cat]          7    cat         dog

As a followup to Andy's point about storing objects, note that what I did here was to extract the first element of the list column (and add a shifted version also). Doing it like this you only have to do an expensive extraction once and after that can stick to standard pandas methods.

162

answered Sep 28 '22 10:09

JohnE

Use pandas (constructs) and vectorize your code i.e. don't use for loops, instead use pandas/numpy functions.

'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1.

Calculate these separately:

df['newcol1'] = df['User'].shift() == df['User']
df.ix[0, 'newcol1'] = True # possibly tweak the first row??

df['newcol1'] = (df['Time'].shift() - df['Time']).abs() > 1

It's unclear to me the purpose of Col1, but general python objects in columns doesn't scale well (you can't use fast path and the contents are scattered in memory). Most of the time you can get away with using something else...

Cython is the very last option, and not needed in 99% of use-cases, but see enhancing performance section of the docs for tips.

answered Sep 28 '22 09:09

Andy Hayden

Related questions
                            
                                Python RPM I built won't install
                            
                                is Flask an MVC or MTV? [closed]
                            
                                'admin' is not a registered namespace in Django 1.4
                            
                                What is the cause of mysqldb's Warning: Truncated incorrect DOUBLE value error?
                            
                                python urllib2 urlopen response
                            
                                Reading only the end of huge text file [duplicate]
                            
                                Check if any value of a dictionary matches a condition
                            
                                Is there a way to pass variables into Jinja2 parents?
                            
                                Python get class name
                            
                                Create a table from query results in Google BigQuery
                            
                                Restarting a program after exception
                            
                                stack trace from manage.py runserver not appearing
                            
                                Enable executing multiple statements while execution via sqlalchemy
                            
                                Pylint W0223: Method ... is abstract in class ... but is not overridden
                            
                                Python3 rounding to nearest even
                            
                                How to partially copy using python an Hdf5 file into a new one keeping the same structure?
                            
                                Pass column name as parameter to PostgreSQL using psycopg2
                            
                                ImportError: cannot import name 'QStringList' in PyQt5
                            
                                How to filter/smooth with SciPy/Numpy?
                            
                                Unable to debug in PyCharm because of an ImportError in pydevconsole.py

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to compare row and previous row in pandas dataframe with millions of rows

Tags:

performance

python

pandas

cython

bigdata

AdO

People also ask

2 Answers

JohnE

Andy Hayden

Recent Activity

Donate For Us