Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create index dictionary from integer list

Tags:

python

numpy

I have a (long) array a of a handful of different integers. I would now like to create a dictionary where the keys are the integers and the values are arrays of indices where in a the respective integer occurs. This

import numpy

a = numpy.array([1, 1, 5, 5, 1])
u = numpy.unique(a)
d = {val: numpy.where(a==val)[0] for val in u}

print(d)
{1: array([0, 1, 4]), 5: array([2, 3])}

works fine, but it seems rather wasteful to first call unique, followed by a couple of wheres.

np.digitize doesn't seem to be ideal either as you have to specify the bins in advance.

Any ideas of how to improve the above?

like image 544
Nico Schlömer Avatar asked Feb 04 '23 02:02

Nico Schlömer


2 Answers

Approach #1

One approach based on sorting would be -

def group_into_dict(a): 
    # Get argsort indices
    sidx = a.argsort()

    # Use argsort indices to sort input array
    sorted_a = a[sidx]

    # Get indices that define the grouping boundaries based on identical elems
    cut_idx = np.flatnonzero(np.r_[True,sorted_a[1:] != sorted_a[:-1],True])

    # Form the final dict with slicing the argsort indices for values and
    # the starts as the keys
    return {sorted_a[i]:sidx[i:j] for i,j in zip(cut_idx[:-1], cut_idx[1:])}

Sample run -

In [55]: a
Out[55]: array([1, 1, 5, 5, 1])

In [56]: group_into_dict(a)
Out[56]: {1: array([0, 1, 4]), 5: array([2, 3])}

Timings on array with 1000000 elements and varying proportion of unique numbers to compare proposed one against the original one -

# 1/100 unique numbers
In [75]: a = np.random.randint(0,10000,(1000000))

In [76]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
1 loop, best of 3: 6.62 s per loop

In [77]: %timeit group_into_dict(a)
10 loops, best of 3: 121 ms per loop

# 1/1000 unique numbers
In [78]: a = np.random.randint(0,1000,(1000000))

In [79]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
1 loop, best of 3: 720 ms per loop

In [80]: %timeit group_into_dict(a)
10 loops, best of 3: 92.1 ms per loop

# 1/10000 unique numbers
In [81]: a = np.random.randint(0,100,(1000000))

In [82]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 120 ms per loop

In [83]: %timeit group_into_dict(a)
10 loops, best of 3: 75 ms per loop

# 1/50000 unique numbers
In [84]: a = np.random.randint(0,20,(1000000))

In [85]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 60.8 ms per loop

In [86]: %timeit group_into_dict(a)
10 loops, best of 3: 60.3 ms per loop

So, if you are dealing with just 20 or less unique numbers, stick to the original one read on; otherwise sorting based one seems to be working well.


Approach #2

Pandas based one suited for very few unique numbers -

In [142]: a
Out[142]: array([1, 1, 5, 5, 1])

In [143]: import pandas as pd

In [144]: {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
Out[144]: {1: array([0, 1, 4]), 5: array([2, 3])}

Timings on array with 1000000 elements with 20 unique elements -

In [146]: a = np.random.randint(0,20,(1000000))

In [147]: %timeit {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
10 loops, best of 3: 35.6 ms per loop

# Original solution
In [148]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 58 ms per loop

and for fewer unique elements -

In [149]: a = np.random.randint(0,10,(1000000))

In [150]: %timeit {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
10 loops, best of 3: 25.3 ms per loop

In [151]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 44.9 ms per loop

In [152]: a = np.random.randint(0,5,(1000000))

In [153]: %timeit {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
100 loops, best of 3: 17.9 ms per loop

In [154]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 34.4 ms per loop

How does pandas help here for fewer elements?

With sorting based approach #1, for the case of 20 unique elements, getting the argsort indices was the bottleneck -

In [164]: a = np.random.randint(0,20,(1000000))

In [165]: %timeit a.argsort()
10 loops, best of 3: 51 ms per loop

Now, pandas based function gives us the unique elements be it negative numbers or anything, which we are simply comparing against the elements in the input array without the need for sorting. Let's see the improvement on that side :

In [166]: %timeit pd.Series(a).unique()
100 loops, best of 3: 3.17 ms per loop

Of course, then it needs to get np.flatnonzero indices, which still keeps it comparatively more efficient.

like image 104
Divakar Avatar answered Feb 06 '23 14:02

Divakar


If it's really just a handful of distincts it may be worthwhile using argpartition instead of argsort. The downside is that this requires a compression step:

import numpy as np

def f_pp_1(a):
    ws = np.empty(a.max()+1, int)
    rng = np.arange(a.size)
    ws[a] = rng
    unq = rng[ws[a] == rng]
    idx = np.argsort(a[unq])
    unq = unq[idx]
    ws[a[unq]] = np.arange(len(unq))
    compressed = ws[a]
    counts = np.cumsum(np.bincount(compressed))
    return dict(zip(a[unq], np.split(np.argpartition(a, counts[:-1]), counts[:-1])))

If the uniques are small we can save the sompresssion step:

def f_pp_2(a):
    bc = np.bincount(a)
    keys, = np.where(bc)
    counts = np.cumsum(bc[keys])
    return dict(zip(keys, np.split(np.argpartition(a, counts[:-1]), counts[:-1])))

data = np.random.randint(0, 10, (5,))[np.random.randint(0, 5, (10000000,))]


sol = f_pp_1(data)
for k, v in sol.items():
    assert np.all(k == data[v])

For small numbers of distincts where is competitive if we can avoid unique:

def f_OP_plus(a):
    ws = np.empty(a.max()+1, int)
    rng = np.arange(a.size)
    ws[a] = rng
    unq = rng[ws[a] == rng]
    idx = np.argsort(a[unq])
    unq = unq[idx]
    return {val: np.where(a==val)[0] for val in unq}

Here are my timings (best of 3, 10 per block) using the same test arrays as @Divakar (randint(0, nd, (ns,)) -- nd, ns = number of distincts, number of samples):

nd, ns: 5, 1000000
OP                   39.88609421 ms
OP_plus              13.04150990 ms
Divakar_1            44.14700069 ms
Divakar_2            21.64940450 ms
pp_1                 33.15216140 ms
pp_2                 22.43267260 ms
nd, ns: 10, 1000000
OP                   52.33878891 ms
OP_plus              17.14743648 ms
Divakar_1            57.76002519 ms
Divakar_2            30.70066951 ms
pp_1                 45.33982391 ms
pp_2                 34.71166079 ms
nd, ns: 20, 1000000
OP                   67.47841339 ms
OP_plus              26.41335099 ms
Divakar_1            71.37646740 ms
Divakar_2            43.09316459 ms
pp_1                 57.16468811 ms
pp_2                 45.55416510 ms
nd, ns: 50, 1000000
OP                   98.91191521 ms
OP_plus              51.15756912 ms
Divakar_1            72.72288438 ms
Divakar_2            70.31920571 ms
pp_1                 63.78925461 ms
pp_2                 53.00321991 ms
nd, ns: 100, 1000000
OP                  148.17743159 ms
OP_plus              92.62091429 ms
Divakar_1            85.02774101 ms
Divakar_2           116.78823209 ms
pp_1                 77.01576019 ms
pp_2                 66.70976470 ms

And if we don't use the first nd integers for uniques but instead draw them randomly from between 0 and 10000:

nd, ns: 5, 1000000
OP                   40.11689581 ms
OP_plus              12.99256920 ms
Divakar_1            42.13181480 ms
Divakar_2            21.55767360 ms
pp_1                 33.21835019 ms
pp_2                 23.46851982 ms
nd, ns: 10, 1000000
OP                   52.84317869 ms
OP_plus              17.96655210 ms
Divakar_1            57.74175161 ms
Divakar_2            32.31985010 ms
pp_1                 44.79893579 ms
pp_2                 33.42640731 ms
nd, ns: 20, 1000000
OP                   66.46886449 ms
OP_plus              25.78120639 ms
Divakar_1            66.58960858 ms
Divakar_2            42.47685110 ms
pp_1                 53.67698781 ms
pp_2                 44.53037870 ms
nd, ns: 50, 1000000
OP                   98.95576960 ms
OP_plus              50.79147881 ms
Divakar_1            72.44545210 ms
Divakar_2            70.91441818 ms
pp_1                 64.19071071 ms
pp_2                 53.36350428 ms
nd, ns: 100, 1000000
OP                  145.62422500 ms
OP_plus              90.82918381 ms
Divakar_1            76.92769479 ms
Divakar_2           115.24481240 ms
pp_1                 70.85122908 ms
pp_2                 58.85340699 ms
like image 36
Paul Panzer Avatar answered Feb 06 '23 14:02

Paul Panzer