Python: faster operation for indexing

Tags:

I have the following snippet that extracts indices of all unique values (hashable) in a sequence-like data with canonical indices and store them in a dictionary as lists:

from collections import defaultdict
idx_lists = defaultdict(list)
for idx, ele in enumerate(data):
    idx_lists[ele].append(idx)

This looks like to me a quite common use case. And it happens that 90% of the execution time of my code is spent in these few lines. This part is passed through over 10000 times during execution, and len(data) is around 50000 to 100000 each time this is run. Number of unique elements ranges from 50 to 150 roughly.

Is there a faster way, perhaps vectorized/c-extended (e.g. numpy or pandas methods), that achieves the same thing?

Many many thanks.

710

asked Jan 06 '16 02:01

Patrick the Cat

2 Answers

Not as impressive as I hoped for originally (there's still a fair bit of pure Python in the groupby code path), but you might be able to cut the time down by a factor of 2-4, depending on how much you care about the exact final types involved:

import numpy as np, pandas as pd
from collections import defaultdict

def by_dd(data):
    idx_lists = defaultdict(list)
    for idx, ele in enumerate(data):
        idx_lists[ele].append(idx)
    return idx_lists

def by_pand1(data):
    return {k: v.tolist() for k,v in data.groupby(data.values).indices.items()}

def by_pand2(data):
    return data.groupby(data.values).indices

data = pd.Series(np.random.randint(0, 100, size=10**5))

gives me

>>> %timeit by_dd(data)
10 loops, best of 3: 42.9 ms per loop
>>> %timeit by_pand1(data)
100 loops, best of 3: 18.2 ms per loop
>>> %timeit by_pand2(data)
100 loops, best of 3: 11.5 ms per loop

142

answered Oct 18 '22 03:10

DSM

Though it's not the perfect solution (it's O(NlogN) instead of O(N)), a much faster, vectorized way to do it is:

def data_to_idxlists(data):
    sorting_ixs = np.argsort(data)
    uniques, unique_indices = np.unique(data[sorting_ixs], return_index = True)
    return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])}

Another solution that is O(N*U), (where U is the number of unique groups):

def data_to_idxlists(data):
    u, ixs = np.unique(data, return_inverse=True)
    return {u: np.nonzero(ixs==i) for i, u in enumerate(u)}

answered Oct 18 '22 04:10

Peter

Related questions
                            
                                Django filter filter with lists
                            
                                Converting np-array from float to complex
                            
                                No module named 'lxml' Windows 8.1
                            
                                VGG Face Descriptor in python with caffe
                            
                                Individually labeled bars for bar graph in Plotly
                            
                                Merging numpy array elements using join() in python
                            
                                AttributeError: Unknown property color_cycle
                            
                                Reversing a list slice in python
                            
                                How to pass additional parameters to user-defined methods in pyspark for filter method?
                            
                                how to add some statistics to the plot in python
                            
                                How does np.multiply work?
                            
                                python: constructor argument notation
                            
                                IPC between C application and Python
                            
                                How to wait until matplotlib animation ends?
                            
                                Why is json.dumps() a must in Flask?
                            
                                Vectorized calculation of a column's value based on a previous value of the same column?
                            
                                Overcoming MemoryError / Slow Runtime in Ashton String task
                            
                                How to write this algorithm in a python code?
                            
                                how to get the value of multiple maximas in an array in python
                            
                                Python Dictionary to Pandas Dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: faster operation for indexing

Tags:

performance

python

indexing

pandas

numpy

Patrick the Cat

People also ask

2 Answers

DSM

Peter

Recent Activity

Donate For Us