I have the pandas DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({
'x': ['a', 'b', 'c'],
'y': [1, 2, 2],
'z': ['f', 's', 's']
}).set_index('x')
from which I would like to select rows based on values of the index (x
) in the selection array
selection = ['a', 'c', 'b', 'b', 'c', 'a']
The correct output can be obtained by using df.loc
as follows
out = df.loc[selection]
The problem I am running in to is df.loc
is running pretty slow on large DataFrames (2-7 million rows). Is there a way to speed up this operation? I've looked into eval()
, but it doesn't seem to apply to hard-coded lists of index values like this. I have also thought about using pd.DataFrame.isin
, but that misses the repeat values (only returns a row per unique element in selection
).
You can get a decent speedup by using reindex
instead of loc
:
df.reindex(selection)
Timings (version 0.17.0):
>>> selection2 = selection * 100 # a larger list of labels
>>> %timeit df.loc[selection2]
100 loops, best of 3: 2.54 ms per loop
>>> %timeit df.reindex(selection2)
1000 loops, best of 3: 833 µs per loop
The two methods take different paths (hence the speed difference).
loc
builds the new DataFrame by calling down to
get_indexer_non_unique
which is necessarily more complex than the simple get_indexer
(used for unique values).
On the other hand, the hard work in reindex
appears to be done by the take_*
functions in generated.pyx. These functions appear to be faster for the purpose of constructing the new DataFrame.
You can try merge
:
df = pd.DataFrame({
'x': ['a', 'b', 'c'],
'y': [1, 2, 2],
'z': ['f', 's', 's']
})
df1 = pd.DataFrame({'x':selection})
In [21]: pd.merge(df1,df,on='x', how='left')
Out[21]:
x y z
0 a 1 f
1 c 2 s
2 b 2 s
3 b 2 s
4 c 2 s
5 a 1 f
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With