Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficient loop over numpy array

Versions of this question have already been asked but I have not found a satisfactory answer.

Problem: given a large numpy vector, find indices of the vector elements which are duplicated (a variation of that could be comparison with tolerance).

So the problem is ~O(N^2) and memory bound (at least from the current algorithm point of view). I wonder why whatever I tried Python is 100x or more slower than an equivalent C code.

import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
dupl = []
print("init done")
counter = 0
for i in range(N):
    for j in range(i+1, N):
        if vect[i] == vect[j]:
            dupl.append(j)
            counter += 1

print("counter =", counter)
print(dupl)
# For simplicity, this code ignores repeated indices 
# which can be trimmed later. Ref output is
# counter = 3
# [2500, 5000, 5000]

I tried using numpy iterators but they are even worse (~ x4-5) http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html

Using N=10,000 I'm getting 0.1 sec in C, 12 sec in Python (code above), 40 sec in Python using np.nditer, 50 sec in Python using np.ndindex. I pushed it to N=160,000 and the timing scales as N^2 as expected.

like image 768
ink Avatar asked Sep 07 '16 13:09

ink


5 Answers

Since the answers have stopped coming and none was totally satisfactory, for the record I post my own solution.

It is my understanding that it's the assignment which makes Python slow in this case, not the nested loops as I thought initially. Using a library or compiled code eliminates the need for assignments and performance improves dramatically.

from __future__ import print_function
import numpy as np
from numba import jit

N = 10000
vect = np.arange(N, dtype=np.float32)

vect[N/2] = 1
vect[N/4] = 1
dupl = np.zeros(N, dtype=np.int32)

print("init done")
# uncomment to enable compiled function
#@jit
def duplicates(i, counter, dupl, vect):
    eps = 0.01
    ns = len(vect)
    for j in range(i+1, ns):
        # replace if to use approx comparison
        #if abs(vect[i] - vect[j]) < eps:
        if vect[i] == vect[j]:
            dupl[counter] = j
            counter += 1
    return counter

counter = 0
for i in xrange(N):
    counter = duplicates(i, counter, dupl, vect)

print("counter =", counter)
print(dupl[0:counter])

Tests

# no jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]

elapsed 10.135 s

# with jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]

elapsed 0.480 s

The performance of compiled version (with @jit uncommented) is close to C code performance ~0.1 - 0.2 sec. Perhaps eliminating the last loop could improve the performance even further. The difference in performance is even stronger when using approximate comparison using eps while there is very little difference for the compiled version.

# no jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]

elapsed 109.218 s

# with jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]

elapsed 0.506 s

This is ~ 200x difference. In the real code, I had to put both loops in the function as well as use a function template with variable types so it was a bit more complex but not very much.

like image 169
ink Avatar answered Oct 29 '22 02:10

ink


Python itself is a highly-dynamic, slow, language. The idea in numpy is to use vectorization, and avoid explicit loops. In this case, you can use np.equal.outer. You can start with

a = np.equal.outer(vect, vect)

Now, for example, to find the sum:

 >>> np.sum(a)
 10006

To find the indices of i that are equal, you can do

np.fill_diagonal(a, 0)

>>> np.nonzero(np.any(a, axis=0))[0]
array([   1, 2500, 5000])

Timing

def find_vec():
    a = np.equal.outer(vect, vect)
    s = np.sum(a)
    np.fill_diagonal(a, 0)
    return np.sum(a), np.nonzero(np.any(a, axis=0))[0]

>>> %timeit find_vec()
1 loops, best of 3: 214 ms per loop

def find_loop():
    dupl = []
    counter = 0
    for i in range(N):
        for j in range(i+1, N):
             if vect[i] == vect[j]:
                 dupl.append(j)
                 counter += 1
    return dupl

>>> % timeit find_loop()
1 loops, best of 3: 8.51 s per loop
like image 37
Ami Tavory Avatar answered Oct 29 '22 01:10

Ami Tavory


This solution using the numpy_indexed package has complexity n Log n, and is fully vectorized; so not terribly different from C performance, in all likelihood.

import numpy_indexed as npi
dpl = np.flatnonzero(npi.multiplicity(vect) > 1)
like image 34
Eelco Hoogendoorn Avatar answered Oct 29 '22 03:10

Eelco Hoogendoorn


The obvious question is why you want to do this in this way. NumPy arrays are intended to be opaque data structures – by this I mean NumPy arrays are intended to be created inside the NumPy system and then operations sent in to the NumPy subsystem to deliver a result. i.e. NumPy should be a black box into which you throw requests and out come results.

So given the code above I am not at all suprised that NumPy performance is worse than dreadful.

The following should be effectively what you want, I believe, but done the NumPy way:

import numpy as np

N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1

print([np.where(a == vect)[0] for a in vect][1])

# Delivers [1, 2500, 5000]
like image 44
Russel Winder Avatar answered Oct 29 '22 01:10

Russel Winder


Approach #1

You can simulate that iterator dependency criteria for a vectorized solution using a triangular matrix. This is based on this post that dealt with multiplication involving iterator dependency. For performing the elementwise equality of each element in vect against its all elements, we can use NumPy broadcasting. Finally, we can use np.count_nonzero to get the count, as it's supposed to be very efficient in summing purposes on boolean arrays.

So, we would have a solution like so -

mask = np.triu(vect[:,None] == vect,1)
counter = np.count_nonzero(mask)
dupl = np.where(mask)[1]

If you only care about the count counter, we could have two more approaches as listed next.

Approach #2

We can avoid the use of the triangular matrix and simply get the entire count and just subtract the contribution from diagonal elements and consider just one of either lower of upper triangular regions by just halving the remaining count as the contributions from either ones would be identical.

So, we would have a modified solution like so -

counter = (np.count_nonzero(vect[:,None] == vect) - vect.size)//2

Approach #3

Here's an entirely different approach that uses the fact the count of each unique element plays a cumsumed contribution to the final total.

So, with that idea in mind, we would have a third approach like so -

count = np.bincount(vect) # OR np.unique(vect,return_counts=True)[1]
idx = count[count>1]
id_arr = np.ones(idx.sum(),dtype=int)
id_arr[0] = 0
id_arr[idx[:-1].cumsum()] = -idx[:-1]+1
counter = np.sum(id_arr.cumsum())
like image 1
Divakar Avatar answered Oct 29 '22 01:10

Divakar