Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing lists in two columns row-wise efficiently

When having a Pandas DataFrame like this:

import pandas as pd
import numpy as np
df = pd.DataFrame({'today': [['a', 'b', 'c'], ['a', 'b'], ['b']], 
                   'yesterday': [['a', 'b'], ['a'], ['a']]})
                 today        yesterday
0      ['a', 'b', 'c']       ['a', 'b']
1           ['a', 'b']            ['a']
2                ['b']            ['a']                          
... etc

But with about 100 000 entries, I am looking to find the additions and removals of those lists in the two columns on a row-wise basis.

It is comparable to this question: Pandas: How to Compare Columns of Lists Row-wise in a DataFrame with Pandas (not for loop)? but I am looking at the differences, and Pandas.apply method seems not to be that fast for such many entries. This is the code that I am currently using. Pandas.apply with numpy's setdiff1d method:

additions = df.apply(lambda row: np.setdiff1d(row.today, row.yesterday), axis=1)
removals  = df.apply(lambda row: np.setdiff1d(row.yesterday, row.today), axis=1)

This works fine, however it takes about a minute for 120 000 entries. So is there a faster way to accomplish this?

like image 477
MegaCookie Avatar asked Jan 08 '20 19:01

MegaCookie


People also ask

How do I compare rows in pandas?

You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.


1 Answers

Not sure about performance, but at the lack of a better solution this might apply:

temp = df[['today', 'yesterday']].applymap(set)
removals = temp.diff(periods=1, axis=1).dropna(axis=1)
additions = temp.diff(periods=-1, axis=1).dropna(axis=1) 

Removals:

  yesterday
0        {}
1        {}
2       {a}

Additions:

  today
0   {c}
1   {b}
2   {b}
like image 177
r.ook Avatar answered Sep 28 '22 02:09

r.ook