I have the following snippet that extracts indices of all unique values (hashable) in a sequence-like data
with canonical indices and store them in a dictionary as lists:
from collections import defaultdict
idx_lists = defaultdict(list)
for idx, ele in enumerate(data):
idx_lists[ele].append(idx)
This looks like to me a quite common use case. And it happens that 90% of the execution time of my code is spent in these few lines. This part is passed through over 10000 times during execution, and len(data)
is around 50000 to 100000 each time this is run. Number of unique elements ranges from 50 to 150 roughly.
Is there a faster way, perhaps vectorized/c-extended (e.g. numpy
or pandas
methods), that achieves the same thing?
Many many thanks.
For index loop is working almost 2 times faster compare to for each loop and for iterator loop.
What are Indexing and Slicing? Indexing: Indexing is used to obtain individual elements. Slicing: Slicing is used to obtain a sequence of elements. Indexing and Slicing can be be done in Python Sequences types like list, string, tuple, range objects.
The Python index() method helps you find the index position of an element or an item in a string of characters or a list of items. It spits out the lowest possible index of the specified element in the list. In case the specified item does not exist in the list, a ValueError is returned.
Not as impressive as I hoped for originally (there's still a fair bit of pure Python in the groupby code path), but you might be able to cut the time down by a factor of 2-4, depending on how much you care about the exact final types involved:
import numpy as np, pandas as pd
from collections import defaultdict
def by_dd(data):
idx_lists = defaultdict(list)
for idx, ele in enumerate(data):
idx_lists[ele].append(idx)
return idx_lists
def by_pand1(data):
return {k: v.tolist() for k,v in data.groupby(data.values).indices.items()}
def by_pand2(data):
return data.groupby(data.values).indices
data = pd.Series(np.random.randint(0, 100, size=10**5))
gives me
>>> %timeit by_dd(data)
10 loops, best of 3: 42.9 ms per loop
>>> %timeit by_pand1(data)
100 loops, best of 3: 18.2 ms per loop
>>> %timeit by_pand2(data)
100 loops, best of 3: 11.5 ms per loop
Though it's not the perfect solution (it's O(NlogN) instead of O(N)), a much faster, vectorized way to do it is:
def data_to_idxlists(data):
sorting_ixs = np.argsort(data)
uniques, unique_indices = np.unique(data[sorting_ixs], return_index = True)
return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])}
Another solution that is O(N*U), (where U is the number of unique groups):
def data_to_idxlists(data):
u, ixs = np.unique(data, return_inverse=True)
return {u: np.nonzero(ixs==i) for i, u in enumerate(u)}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With