Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing previous row values in Pandas DataFrame

import pandas as pd data={'col1':[1,3,3,1,2,3,2,2]} df=pd.DataFrame(data,columns=['col1']) print df            col1       0     1               1     3               2     3               3     1               4     2               5     3               6     2               7     2           

I have the following Pandas DataFrame and I want to create another column that compares the previous row of col1 to see if they are equal. What would be the best way to do this? It would be like the following DataFrame. Thanks

    col1  match   0     1   False      1     3   False      2     3   True      3     1   False      4     2   False      5     3   False      6     2   False      7     2   True      
like image 959
jth359 Avatar asked Dec 30 '16 16:12

jth359


People also ask

How do I compare row values in Pandas?

You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.

How do I compare two rows in a DataFrame Pandas?

During data analysis, one might need to compute the difference between two rows for comparison purposes. This can be done using pandas. DataFrame. diff() function.

What does diff do in Pandas?

The diff() function calculates the difference of a DataFrame element compared with another element in the DataFrame. Periods to shift for calculating difference, accepts negative values.

How can you tell if two rows are the same in Pandas?

The equals() function is used to test whether two Pandas objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.


2 Answers

You need eq with shift:

df['match'] = df.col1.eq(df.col1.shift()) print (df)    col1  match 0     1  False 1     3  False 2     3   True 3     1  False 4     2  False 5     3  False 6     2  False 7     2   True 

Or instead eq use ==, but it is a bit slowier in large DataFrame:

df['match'] = df.col1 == df.col1.shift() print (df)    col1  match 0     1  False 1     3  False 2     3   True 3     1  False 4     2  False 5     3  False 6     2  False 7     2   True 

Timings:

import pandas as pd data={'col1':[1,3,3,1,2,3,2,2]} df=pd.DataFrame(data,columns=['col1']) print (df) #[80000 rows x 1 columns] df = pd.concat([df]*10000).reset_index(drop=True)  df['match'] = df.col1 == df.col1.shift() df['match1'] = df.col1.eq(df.col1.shift()) print (df)  In [208]: %timeit df.col1.eq(df.col1.shift()) The slowest run took 4.83 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 933 µs per loop  In [209]: %timeit df.col1 == df.col1.shift() 1000 loops, best of 3: 1 ms per loop 
like image 171
jezrael Avatar answered Oct 08 '22 22:10

jezrael


1) pandas approach: Use diff:

df['match'] = df['col1'].diff().eq(0) 

2) numpy approach: Use np.ediff1d.

df['match'] = np.ediff1d(df['col1'].values, to_begin=np.NaN) == 0 

Both produce:

enter image description here

Timings: (for the same DF used by @jezrael)

%timeit df.col1.eq(df.col1.shift()) 1000 loops, best of 3: 731 µs per loop  %timeit df['col1'].diff().eq(0) 1000 loops, best of 3: 405 µs per loop 
like image 41
Nickil Maveli Avatar answered Oct 08 '22 23:10

Nickil Maveli