Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast pandas filtering

Tags:

python

pandas

I want to filter a pandas dataframe, if the name column entry has an item in a given list.

Here we have a DataFrame

x = DataFrame(
    [['sam', 328], ['ruby', 3213], ['jon', 121]], 
    columns=['name', 'score'])

Now lets say we have a list, ['sam', 'ruby'] and we want to find all rows where the name is in the list, then sum the score.

The solution I have is as follows:

total = 0
names = ['sam', 'ruby']
for name in names:
     identified = x[x['name'] == name]
     total = total + sum(identified['score'])

However when the dataframe gets extremely large, and the list of names gets very large too, everything is very very slow.

Is there any faster alternative?

Thanks

like image 400
redrubia Avatar asked Feb 12 '14 20:02

redrubia


2 Answers

Try using isin (thanks to DSM for suggesting loc over ix here):

In [78]: x = pd.DataFrame([['sam',328],['ruby',3213],['jon',121]], columns = ['name', 'score'])

In [79]: names = ['sam', 'ruby']

In [80]: x['name'].isin(names)
Out[80]: 
0     True
1     True
2    False
Name: name, dtype: bool

In [81]: x.loc[x['name'].isin(names), 'score'].sum()
Out[81]: 3541

CT Zhu suggests a faster alternative using np.in1d:

In [105]: y = pd.concat([x]*1000)
In [109]: %timeit y.loc[y['name'].isin(names), 'score'].sum()
1000 loops, best of 3: 413 µs per loop

In [110]: %timeit y.loc[np.in1d(y['name'], names), 'score'].sum()
1000 loops, best of 3: 335 µs per loop
like image 62
unutbu Avatar answered Oct 13 '22 01:10

unutbu


If I need to search on a field, I have noticed that it helps immensely if I change the index of the DataFrame to the search field. For one of my search and lookup requirements I got a performance improvement of around 500%.

So in your case the following could be used to search and filter by name.

df = pd.DataFrame([['sam', 328], ['ruby', 3213], ['jon', 121]], 
                 columns=['name', 'score'])
names = ['sam', 'ruby']

df_searchable = df.set_index('name')

df_searchable[df_searchable.index.isin(names)]

Update Dec-21

Updates are driven by the comments on this answer.

Looking at the details of my use case, its not magic that is happening here. My use case was that of running millions of look-ups on a column with around 45k values. From what I remember, it was a lookup on US zip-codes. Understandably, once the set_index has incurred it's one time optimization cost, subsequent look-ups become way faster. The overall effect is magnified because of the large number of look-ups, the cost of optimization getting amortized over all the numerous look-ups.

The impressive performance improvement number is essentially due to the highly amortized optimization cost.

like image 33
Dhwani Katagade Avatar answered Oct 13 '22 00:10

Dhwani Katagade