Suppose I have an pandas series that I'd like to function as a multimap (multiple values for each index key):
# intval -> data1
a = pd.Series(data=-np.arange(100000),
index=np.random.randint(0, 50000, 100000))
I'd like to select (as quickly as possible) all the values from a
where a
's index matches another index b
. (Like an inner join. Or a merge but for series).
a
may have duplicates in its index.b
may not have duplicates and it not necessarily a subset of a
's index. To give pandas the best possible chance, let's assume b
can also be provided as a sorted index object: b = pd.Index(np.unique(np.random.randint(30000, 100000, 100000))).sortvalues()
So, we would have something like:
target
a b result
3 0 3 3 0
3 1 7 8 3
4 2 8 ...
8 3 ...
9 4
...
I'm also only interested in getting the values of the result (index [3,8,...]
not needed).
If a
did not have duplicates, we would simply do:
a.reindex(b) # Cannot reindex a duplicate axis
Because &
maintains the duplicates of a
, we can't do:
d = a[a.index & b.index]
d = a.loc[a.index & b.index] # same
d = a.get(a.index & b.index) # same
print d.shape
So I think we need to do something like:
common = (a.index & b.index).unique()
a.loc[common]
... which is cumbersome, but also is surprising slow. It's not build the list of items to select that's slow:
%timeit (a.index & b).unique()
# 100 loops, best of 3: 3.39 ms per loop
%timeit (a.index & b).unique().sort_values()
# 100 loops, best of 3: 4.19 ms per loop
... so it look like its really retrieving the values that's slow:
common = ((a.index & b).unique()).sort_values()
%timeit a.loc[common]
#10 loops, best of 3: 43.3 ms per loop
%timeit a.get(common)
#10 loops, best of 3: 42.1 ms per loop
... That's around 20 operations per seconds. Not exactly zippy! Why so slow?
Surely there must be a fast way to lookup as set of values from pandas dataframe? I don't want to get an indexed object out -- really all I'm asking for is a merge over sorted indexes, or (slower) hashed int lookups. Either way, this should be an extremely fast operation -- not a 20 per second operation on my 3Ghz machine.
Also:
Profiling a.loc[common]
give:
ncalls tottime percall cumtime percall filename:lineno(function)
# All the time spent here.
40 1.01 0.02525 1.018 0.02546 ~:0(<method 'get_indexer_non_unique' indexing.py:1443(_has_valid_type)
...
# seems to be called a lot.
1500 0.000582 3.88e-07 0.000832 5.547e-07 ~:0(<isinstance>)
PS. I posted a similar question previously, about why Series.map is so slow Why is pandas.series.map so shockingly slow? . The reason was lazy-under-the-hood-indexing. This doesn't seem to be happening here.
Update:
For similarly sizes a and common where a is unique:
% timeit a.loc[common]
1000 loops, best of 3: 760 µs per loop
... as @jpp points out. Multiindex is likely to blame.
Repeated indices are guaranteed to slow down your dataframe indexing operations. You can amend your inputs to prove this to yourself:
a = pd.Series(data=-np.arange(100000), index=np.random.randint(0, 50000, 100000))
%timeit a.loc[common] # 34.1 ms
a = pd.Series(data=-np.arange(100000), index=np.arange(100000))
%timeit a.loc[common] # 6.86 ms
As mentioned in this related question:
When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With