I need to find the indices of the first less than or equal occurrence of elements of one array in another array. One way that works is this:
import numpy
a = numpy.array([10,7,2,0])
b = numpy.array([10,9,8,7,6,5,4,3,2,1])
indices = [numpy.where(a<=x)[0][0] for x in b]
indices has the value [0, 1, 1, 1, 2, 2, 2, 2, 2, 3], which is what I need. The problem of course, is that python "for" loop is slow and my arrays might have millions of elements. Is there any numpy trick for this? This doesn't work because they arrays are not of the same length:
indices = numpy.where(a<=b) #XXX: raises an exception
Thanks!
This may be a special case, but you should be able to use numpy digitize. The caveat here is the bins must be monotonically decreasing or increasing.
>>> import numpy
>>> a = numpy.array([10,7,2,0])
>>> b = numpy.array([10,9,8,7,6,5,4,3,2,1])
>>> indices = [numpy.where(a<=x)[0][0] for x in b]
[0, 1, 1, 1, 2, 2, 2, 2, 2, 3]
>>> numpy.digitize(b,a)
array([0, 1, 1, 1, 2, 2, 2, 2, 2, 3])
Setup for the timing test:
a = np.arange(50)[::-1]
b = np.random.randint(0,50,1E3)
np.allclose([np.where(a<=x)[0][0] for x in b],np.digitize(b,a))
Out[55]: True
Some timings:
%timeit [np.where(a<=x)[0][0] for x in b]
100 loops, best of 3: 4.97 ms per loop
%timeit np.digitize(b,a)
10000 loops, best of 3: 48.1 µs per loop
Looks like two orders of magnitude speed up, this will depend heavily on the number of bins however. Your timings will vary.
To compare to Jamie's answer I have timed the two following pieces of code. As I mainly wanted to focus on the speed of searchsorted
vs digitize
I pared down Jamie's code a bit. The relevant chunk is here:
a = np.arange(size_a)[::-1]
b = np.random.randint(0, size_a, size_b)
ja = np.take(a, np.searchsorted(a, b, side='right', sorter=a)-1)
#Compare to digitize
if ~np.allclose(ja,np.digitize(b,a)):
print 'Comparison failed'
timing_digitize[num_a,num_b] = timeit.timeit('np.digitize(b,a)',
'import numpy as np; from __main__ import a, b',
number=3)
timing_searchsorted[num_a,num_b] = timeit.timeit('np.take(a, np.searchsorted(a, b, side="right", sorter=a)-1)',
'import numpy as np; from __main__ import a, b',
number=3)
This is a bit beyond my limited matplotlib ability so this is done in DataGraph. I have plotted the logarithmic ratio of timing_digitize/timing_searchsorted
so values greater then zero searchsorted
is faster and values less then zero digitize
is faster. The colors also give relative speeds. For example is shows that in the top right (a = 1E6, b=1E6) digitize
is ~300 times slower then searchsorted
while for smaller sizes digitize
can be up to 10x faster. The black line is roughly the break even point:
Looks like for raw speed searchsorted
is almost always faster for large cases, but the simple syntax of digitize
is nearly as good if the number of bins is small.
This is messy, but it works:
>>> idx = np.argsort(a)
>>> np.take(idx, np.searchsorted(a, b, side='right', sorter=idx)-1)
array([0, 1, 1, 1, 2, 2, 2, 2, 2, 3], dtype=int64)
If your array is always sorted, you should be able to get rid of the argsort
call.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With