Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find multiple maximum values in a 2d array fast

The situation is as follows:

I have a 2D numpy array. Its shape is (1002, 1004). Each element contains a value between 0 and Inf. What I now want to do is determine the first 1000 maximum values and store the corresponding indices in to a list named x and a list named y. This is because I want to plot the maximum values and the indices actually correspond to real time x and y position of the value.

What I have so far is:

x = numpy.zeros(500)
y = numpy.zeros(500)

for idx in range(500):
    x[idx] = numpy.unravel_index(full.argmax(), full.shape)[0]
    y[idx] = numpy.unravel_index(full.argmax(), full.shape)[1]
    full[full == full.max()] = 0.

print os.times()

Here full is my 2D numpy array. As can be seen from the for loop, I only determine the first 500 maximum values at the moment. This however already takes about 5 s. For the first 1000 maximum values, the user time should actually be around 0.5 s. I've noticed that a very time consuming part is setting the previous maximum value to 0 each time. How can I speed things up?

Thank you so much!

like image 736
The Dude Avatar asked Dec 29 '13 14:12

The Dude


2 Answers

If you have numpy 1.8, you can use the argpartition function or method. Here's a script that calculates x and y:

import numpy as np

# Create an array to work with.
np.random.seed(123)
full = np.random.randint(1, 99, size=(8, 8))

# Get the indices for the largest `num_largest` values.
num_largest = 8

indices = (-full).argpartition(num_largest, axis=None)[:num_largest]
# OR, if you want to avoid the temporary array created by `-full`:
# indices = full.argpartition(full.size - num_largest, axis=None)[-num_largest:]

x, y = np.unravel_index(indices, full.shape)

print("full:")
print(full)
print("x =", x)
print("y =", y)
print("Largest values:", full[x, y])
print("Compare to:    ", np.sort(full, axis=None)[-num_largest:])

Output:

full:
[[67 93 18 84 58 87 98 97]
 [48 74 33 47 97 26 84 79]
 [37 97 81 69 50 56 68  3]
 [85 40 67 85 48 62 49  8]
 [93 53 98 86 95 28 35 98]
 [77 41  4 70 65 76 35 59]
 [11 23 78 19 16 28 31 53]
 [71 27 81  7 15 76 55 72]]
x = [0 2 4 4 0 1 4 0]
y = [6 1 7 2 7 4 4 1]
Largest values: [98 97 98 98 97 97 95 93]
Compare to:     [93 95 97 97 97 98 98 98]
like image 86
Warren Weckesser Avatar answered Sep 21 '22 22:09

Warren Weckesser


You could loop through the array as @Inspired suggests, but looping through NumPy arrays item-by-item tends to lead to slower-performing code than code which uses NumPy functions since the NumPy functions are written in C/Fortran, while the item-by-item loop tends to use Python functions.

So, although sorting is O(n log n), it may be quicker than a Python-based one-pass O(n) solution. Below np.unique performs the sort:

import numpy as np

def nlargest_indices(arr, n):
    uniques = np.unique(arr)
    threshold = uniques[-n]
    return np.where(arr >= threshold)

full = np.random.random((1002,1004))
x, y = nlargest_indices(full, 10)
print(full[x, y])
print(x)
# [  2   7 217 267 299 683 775 825 853]
print(y)
# [645 621 132 242 556 439 621 884 367]

Here is a timeit benchmark comparing nlargest_indices (above) to

def nlargest_indices_orig(full, n):
    full = full.copy()
    x = np.zeros(n)
    y = np.zeros(n)

    for idx in range(n):
        x[idx] = np.unravel_index(full.argmax(), full.shape)[0]
        y[idx] = np.unravel_index(full.argmax(), full.shape)[1]
        full[full == full.max()] = 0.
    return x, y


In [97]: %timeit nlargest_indices_orig(full, 500)
1 loops, best of 3: 5 s per loop

In [98]: %timeit nlargest_indices(full, 500)
10 loops, best of 3: 133 ms per loop

For timeit purposes I needed to copy the array inside nlargest_indices_orig, lest full get mutated by the timing loop.

Benchmarking the copying operation:

def base(full, n):
    full = full.copy()

In [102]: %timeit base(full, 500)
100 loops, best of 3: 4.11 ms per loop

shows this added about 4ms to the 5s benchmark for nlargest_indices_orig.


Warning: nlargest_indices and nlargest_indices_orig may return different results if arr contains repeated values.

nlargest_indices finds the n largest values in arr and then returns the x and y indices corresponding to the locations of those values.

nlargest_indices_orig finds the n largest values in arr and then returns one x and y index for each large value. If there is more than one x and y corresponding to the same large value, then some locations where large values occur may be missed.

They also return indices in a different order, but I suppose that does not matter for your purpose of plotting.

like image 37
unutbu Avatar answered Sep 21 '22 22:09

unutbu