I have a (long) array a
of a handful of different integers. I would now like to create a dictionary where the keys are the integers and the values are arrays of indices where in a
the respective integer occurs. This
import numpy
a = numpy.array([1, 1, 5, 5, 1])
u = numpy.unique(a)
d = {val: numpy.where(a==val)[0] for val in u}
print(d)
{1: array([0, 1, 4]), 5: array([2, 3])}
works fine, but it seems rather wasteful to first call unique
, followed by a couple of where
s.
np.digitize
doesn't seem to be ideal either as you have to specify the bins in advance.
Any ideas of how to improve the above?
Approach #1
One approach based on sorting would be -
def group_into_dict(a):
# Get argsort indices
sidx = a.argsort()
# Use argsort indices to sort input array
sorted_a = a[sidx]
# Get indices that define the grouping boundaries based on identical elems
cut_idx = np.flatnonzero(np.r_[True,sorted_a[1:] != sorted_a[:-1],True])
# Form the final dict with slicing the argsort indices for values and
# the starts as the keys
return {sorted_a[i]:sidx[i:j] for i,j in zip(cut_idx[:-1], cut_idx[1:])}
Sample run -
In [55]: a
Out[55]: array([1, 1, 5, 5, 1])
In [56]: group_into_dict(a)
Out[56]: {1: array([0, 1, 4]), 5: array([2, 3])}
Timings on array with 1000000
elements and varying proportion of unique numbers to compare proposed one against the original one -
# 1/100 unique numbers
In [75]: a = np.random.randint(0,10000,(1000000))
In [76]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
1 loop, best of 3: 6.62 s per loop
In [77]: %timeit group_into_dict(a)
10 loops, best of 3: 121 ms per loop
# 1/1000 unique numbers
In [78]: a = np.random.randint(0,1000,(1000000))
In [79]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
1 loop, best of 3: 720 ms per loop
In [80]: %timeit group_into_dict(a)
10 loops, best of 3: 92.1 ms per loop
# 1/10000 unique numbers
In [81]: a = np.random.randint(0,100,(1000000))
In [82]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 120 ms per loop
In [83]: %timeit group_into_dict(a)
10 loops, best of 3: 75 ms per loop
# 1/50000 unique numbers
In [84]: a = np.random.randint(0,20,(1000000))
In [85]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 60.8 ms per loop
In [86]: %timeit group_into_dict(a)
10 loops, best of 3: 60.3 ms per loop
So, if you are dealing with just 20
or less unique numbers, stick to the original one read on; otherwise sorting based one seems to be working well.
Approach #2
Pandas
based one suited for very few unique numbers -
In [142]: a
Out[142]: array([1, 1, 5, 5, 1])
In [143]: import pandas as pd
In [144]: {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
Out[144]: {1: array([0, 1, 4]), 5: array([2, 3])}
Timings on array with 1000000
elements with 20
unique elements -
In [146]: a = np.random.randint(0,20,(1000000))
In [147]: %timeit {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
10 loops, best of 3: 35.6 ms per loop
# Original solution
In [148]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 58 ms per loop
and for fewer unique elements -
In [149]: a = np.random.randint(0,10,(1000000))
In [150]: %timeit {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
10 loops, best of 3: 25.3 ms per loop
In [151]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 44.9 ms per loop
In [152]: a = np.random.randint(0,5,(1000000))
In [153]: %timeit {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
100 loops, best of 3: 17.9 ms per loop
In [154]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 34.4 ms per loop
How does pandas
help here for fewer elements?
With sorting based approach #1
, for the case of 20
unique elements, getting the argsort indices was the bottleneck -
In [164]: a = np.random.randint(0,20,(1000000))
In [165]: %timeit a.argsort()
10 loops, best of 3: 51 ms per loop
Now, pandas
based function gives us the unique elements be it negative numbers or anything, which we are simply comparing against the elements in the input array without the need for sorting. Let's see the improvement on that side :
In [166]: %timeit pd.Series(a).unique()
100 loops, best of 3: 3.17 ms per loop
Of course, then it needs to get np.flatnonzero
indices, which still keeps it comparatively more efficient.
If it's really just a handful of distincts it may be worthwhile using argpartition instead of argsort. The downside is that this requires a compression step:
import numpy as np
def f_pp_1(a):
ws = np.empty(a.max()+1, int)
rng = np.arange(a.size)
ws[a] = rng
unq = rng[ws[a] == rng]
idx = np.argsort(a[unq])
unq = unq[idx]
ws[a[unq]] = np.arange(len(unq))
compressed = ws[a]
counts = np.cumsum(np.bincount(compressed))
return dict(zip(a[unq], np.split(np.argpartition(a, counts[:-1]), counts[:-1])))
If the uniques are small we can save the sompresssion step:
def f_pp_2(a):
bc = np.bincount(a)
keys, = np.where(bc)
counts = np.cumsum(bc[keys])
return dict(zip(keys, np.split(np.argpartition(a, counts[:-1]), counts[:-1])))
data = np.random.randint(0, 10, (5,))[np.random.randint(0, 5, (10000000,))]
sol = f_pp_1(data)
for k, v in sol.items():
assert np.all(k == data[v])
For small numbers of distincts where
is competitive if we can avoid unique
:
def f_OP_plus(a):
ws = np.empty(a.max()+1, int)
rng = np.arange(a.size)
ws[a] = rng
unq = rng[ws[a] == rng]
idx = np.argsort(a[unq])
unq = unq[idx]
return {val: np.where(a==val)[0] for val in unq}
Here are my timings (best of 3, 10 per block) using the same test arrays as @Divakar (randint(0, nd, (ns,))
-- nd, ns = number of distincts, number of samples):
nd, ns: 5, 1000000
OP 39.88609421 ms
OP_plus 13.04150990 ms
Divakar_1 44.14700069 ms
Divakar_2 21.64940450 ms
pp_1 33.15216140 ms
pp_2 22.43267260 ms
nd, ns: 10, 1000000
OP 52.33878891 ms
OP_plus 17.14743648 ms
Divakar_1 57.76002519 ms
Divakar_2 30.70066951 ms
pp_1 45.33982391 ms
pp_2 34.71166079 ms
nd, ns: 20, 1000000
OP 67.47841339 ms
OP_plus 26.41335099 ms
Divakar_1 71.37646740 ms
Divakar_2 43.09316459 ms
pp_1 57.16468811 ms
pp_2 45.55416510 ms
nd, ns: 50, 1000000
OP 98.91191521 ms
OP_plus 51.15756912 ms
Divakar_1 72.72288438 ms
Divakar_2 70.31920571 ms
pp_1 63.78925461 ms
pp_2 53.00321991 ms
nd, ns: 100, 1000000
OP 148.17743159 ms
OP_plus 92.62091429 ms
Divakar_1 85.02774101 ms
Divakar_2 116.78823209 ms
pp_1 77.01576019 ms
pp_2 66.70976470 ms
And if we don't use the first nd
integers for uniques but instead draw them randomly from between 0
and 10000
:
nd, ns: 5, 1000000
OP 40.11689581 ms
OP_plus 12.99256920 ms
Divakar_1 42.13181480 ms
Divakar_2 21.55767360 ms
pp_1 33.21835019 ms
pp_2 23.46851982 ms
nd, ns: 10, 1000000
OP 52.84317869 ms
OP_plus 17.96655210 ms
Divakar_1 57.74175161 ms
Divakar_2 32.31985010 ms
pp_1 44.79893579 ms
pp_2 33.42640731 ms
nd, ns: 20, 1000000
OP 66.46886449 ms
OP_plus 25.78120639 ms
Divakar_1 66.58960858 ms
Divakar_2 42.47685110 ms
pp_1 53.67698781 ms
pp_2 44.53037870 ms
nd, ns: 50, 1000000
OP 98.95576960 ms
OP_plus 50.79147881 ms
Divakar_1 72.44545210 ms
Divakar_2 70.91441818 ms
pp_1 64.19071071 ms
pp_2 53.36350428 ms
nd, ns: 100, 1000000
OP 145.62422500 ms
OP_plus 90.82918381 ms
Divakar_1 76.92769479 ms
Divakar_2 115.24481240 ms
pp_1 70.85122908 ms
pp_2 58.85340699 ms
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With