Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: speed up df.loc based on repeat index values

I have the pandas DataFrame

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'x': ['a', 'b', 'c'],
    'y': [1, 2, 2],
    'z': ['f', 's', 's']
}).set_index('x')

from which I would like to select rows based on values of the index (x) in the selection array

selection = ['a', 'c', 'b', 'b', 'c', 'a']

The correct output can be obtained by using df.loc as follows

out = df.loc[selection]

The problem I am running in to is df.loc is running pretty slow on large DataFrames (2-7 million rows). Is there a way to speed up this operation? I've looked into eval(), but it doesn't seem to apply to hard-coded lists of index values like this. I have also thought about using pd.DataFrame.isin, but that misses the repeat values (only returns a row per unique element in selection).

like image 898
philE Avatar asked Oct 26 '15 20:10

philE


2 Answers

You can get a decent speedup by using reindex instead of loc:

df.reindex(selection)

Timings (version 0.17.0):

>>> selection2 = selection * 100 # a larger list of labels
>>> %timeit df.loc[selection2]
100 loops, best of 3: 2.54 ms per loop

>>> %timeit df.reindex(selection2)
1000 loops, best of 3: 833 µs per loop

The two methods take different paths (hence the speed difference).

loc builds the new DataFrame by calling down to get_indexer_non_unique which is necessarily more complex than the simple get_indexer (used for unique values).

On the other hand, the hard work in reindex appears to be done by the take_* functions in generated.pyx. These functions appear to be faster for the purpose of constructing the new DataFrame.

like image 153
Alex Riley Avatar answered Oct 04 '22 23:10

Alex Riley


You can try merge:

df = pd.DataFrame({
    'x': ['a', 'b', 'c'],
    'y': [1, 2, 2],
    'z': ['f', 's', 's']
})

df1 = pd.DataFrame({'x':selection})

In [21]: pd.merge(df1,df,on='x', how='left')
Out[21]: 
   x  y  z
0  a  1  f
1  c  2  s
2  b  2  s
3  b  2  s
4  c  2  s
5  a  1  f
like image 28
Colonel Beauvel Avatar answered Oct 04 '22 22:10

Colonel Beauvel