Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy Array: Efficiently find matching indices

I have two lists, one of which is massive (millions of elements), the other several thousand. I want to do the following

bigArray=[0,1,0,2,3,2,,.....]

smallArray=[0,1,2,3,4]

for i in len(smallArray):
  pts=np.where(bigArray==smallArray[i])
  #Do stuff with pts...

The above works, but is slow. Is there any way to do this more efficiently without resorting to writing something in C?

like image 906
user1356855 Avatar asked Apr 25 '12 17:04

user1356855


People also ask

Is NumPy indexing fast?

Indexing in NumPy is a reasonably fast operation.

Does NumPy do lazy evaluation?

WeldNumpy is a Weld-enabled library that provides a subclass of NumPy's ndarray module, called weldarray, which supports automatic parallelization, lazy evaluation, and various other optimizations for data science workloads.

How do I compare values in two NumPy arrays?

Method 1: We generally use the == operator to compare two NumPy arrays to generate a new array object. Call ndarray. all() with the new array object as ndarray to return True if the two NumPy arrays are equivalent.

How do you find the index of an element of a NumPy array?

Using ndenumerate() function to find the Index of value It is usually used to find the first occurrence of the element in the given numpy array.


1 Answers

In your case you may benefit from presorting your big array. Here is the example demonstrating how you can reduce the time from ~ 45 seconds to 2 seconds (on my laptop)(for one particular set of lengths of the arrays 5e6 vs 1e3). Obviously the solution won't be optimal if the array sizes will be wastly different. E.g. with the default solution the complexity is O(bigN*smallN), but for my suggested solution it is O((bigN+smallN)*log(bigN))

import numpy as np, numpy.random as nprand, time, bisect

bigN = 5e6
smallN = 1000
maxn = 1e7
nprand.seed(1)  
bigArr = nprand.randint(0, maxn, size=bigN)
smallArr = nprand.randint(0, maxn, size=smallN)

# brute force 
t1 = time.time()
for i in range(len(smallArr)):
    inds = np.where(bigArr == smallArr[i])[0]
t2 = time.time()
print "Brute", t2-t1

# not brute force (like nested loop with index scan)
t1 = time.time()
sortedind = np.argsort(bigArr)
sortedbigArr = bigArr[sortedind]
for i in range(len(smallArr)):
    i1 = bisect.bisect_left(sortedbigArr, smallArr[i])
    i2 = bisect.bisect_right(sortedbigArr, smallArr[i])
    inds = sortedind[i1:i2]
t2=time.time()
print "Non-brute", t2-t1

Output:

Brute 42.5278530121

Non-brute 1.57193303108

like image 95
sega_sai Avatar answered Sep 30 '22 13:09

sega_sai