I have an array a of values whose values at each index idx I'd like to repeat a certain amount b[idx] of times, given at the same index idx in another array (b), like such:
a = numpy.array([1, 2, 3 ,4, 5])
b = numpy.array([2, 3, 1, 2, 4])
Desired output:
c = numpy.array([1, 1, 2, 2, 2, 3, 4, 4, 5, 5, 5, 5])
I realize I could do something like this:
len_a = numpy.shape(a)[0]
sum_b = sum(b)
c = numpy.zeros((1, 0))
for idx in range(len_a):
repeated_a = numpy.repeat(a[idx], b[idx])
repeated_a = numpy.reshape(repeated_a, (1, numpy.shape(repeated_a)[0]))
c = numpy.hstack((c, repeated_a))
However, looping is not a good option because it is slow. How would I go about making this faster? Some form of vectorization perhaps.
You are looking for a built-in repeat function made for this purpose. Simply feed both arrays into the function:
np.repeat(a,b)
#[1 1 2 2 2 3 4 4 5 5 5 5]
Depending on what you know about your two arrays, you might be able to squeeze out more performance than what the native gives you: Do you know what kind of numbers can appear in the two arrays? Do you know how long the resulting array will be, so you can take advantage of that and initialize that memory up front?
To illustrate what kind of difference this makes, let's just look at what you can squeeze out of Numba with no further tricks:
from numba import njit
import numpy as np
@njit
def f(a, b):
s = b.sum()
res = np.empty(s, dtype=np.int64)
cs = 0
for i in range(len(a)):
bi = b[i]
res[cs:cs+bi] = a[i]
cs += bi
return res
Let's use IPython's magic %timeit to see how this compares to np.repeat(a, b):
In [2]: a = np.random.randint(1, 6, 10000)
...: b = np.random.randint(1, 6, 10000)
In [3]: %timeit np.repeat(a, b)
135 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [6]: %timeit f(a, b)
87 µs ± 485 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So that's already a good improvement in itself. Now if you know what b.sum() is beforehand, you could avoid calculating it explicitly and thus save a bit of time. Or if you know how the values of b are distributed, you can figure out an upper bound of the array size and truncate res[:cs] at the end of the day.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With