Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: faster operation for indexing

I have the following snippet that extracts indices of all unique values (hashable) in a sequence-like data with canonical indices and store them in a dictionary as lists:

from collections import defaultdict
idx_lists = defaultdict(list)
for idx, ele in enumerate(data):
    idx_lists[ele].append(idx)

This looks like to me a quite common use case. And it happens that 90% of the execution time of my code is spent in these few lines. This part is passed through over 10000 times during execution, and len(data) is around 50000 to 100000 each time this is run. Number of unique elements ranges from 50 to 150 roughly.

Is there a faster way, perhaps vectorized/c-extended (e.g. numpy or pandas methods), that achieves the same thing?

Many many thanks.

like image 710
Patrick the Cat Avatar asked Jan 06 '16 02:01

Patrick the Cat


People also ask

Is index faster than for loop?

For index loop is working almost 2 times faster compare to for each loop and for iterator loop.

What is the difference between slicing and indexing in Python?

What are Indexing and Slicing? Indexing: Indexing is used to obtain individual elements. Slicing: Slicing is used to obtain a sequence of elements. Indexing and Slicing can be be done in Python Sequences types like list, string, tuple, range objects.

What is index () in Python?

The Python index() method helps you find the index position of an element or an item in a string of characters or a list of items. It spits out the lowest possible index of the specified element in the list. In case the specified item does not exist in the list, a ValueError is returned.


2 Answers

Not as impressive as I hoped for originally (there's still a fair bit of pure Python in the groupby code path), but you might be able to cut the time down by a factor of 2-4, depending on how much you care about the exact final types involved:

import numpy as np, pandas as pd
from collections import defaultdict

def by_dd(data):
    idx_lists = defaultdict(list)
    for idx, ele in enumerate(data):
        idx_lists[ele].append(idx)
    return idx_lists

def by_pand1(data):
    return {k: v.tolist() for k,v in data.groupby(data.values).indices.items()}

def by_pand2(data):
    return data.groupby(data.values).indices

data = pd.Series(np.random.randint(0, 100, size=10**5))    

gives me

>>> %timeit by_dd(data)
10 loops, best of 3: 42.9 ms per loop
>>> %timeit by_pand1(data)
100 loops, best of 3: 18.2 ms per loop
>>> %timeit by_pand2(data)
100 loops, best of 3: 11.5 ms per loop
like image 142
DSM Avatar answered Oct 18 '22 03:10

DSM


Though it's not the perfect solution (it's O(NlogN) instead of O(N)), a much faster, vectorized way to do it is:

def data_to_idxlists(data):
    sorting_ixs = np.argsort(data)
    uniques, unique_indices = np.unique(data[sorting_ixs], return_index = True)
    return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])}

Another solution that is O(N*U), (where U is the number of unique groups):

def data_to_idxlists(data):
    u, ixs = np.unique(data, return_inverse=True)
    return {u: np.nonzero(ixs==i) for i, u in enumerate(u)}
like image 41
Peter Avatar answered Oct 18 '22 04:10

Peter