Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast algorithm to find indices where multiple arrays have the same value

I'm looking for ways to speed up (or replace) my algorithm for grouping data.

I have a list of numpy arrays. I want to generate a new numpy array, such that each element of this array is the same for each index where the original arrays are the same as well. (And different where this is not the case.)

This sounds kind of awkward, so have an example:

# Test values:
values = [
    np.array([10, 11, 10, 11, 10, 11, 10]),
    np.array([21, 21, 22, 22, 21, 22, 23]),
    ]

# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
#                             *           *

Note that elements I marked (indices 0 and 4) of the expected outcome have the same value (0) because the original two arrays were also the same (namely 10 and 21). Similar for elements with indices 3 and 5 (3).

The algorithm has to deal with an arbitrary number of (equally-size) input arrays, and also return, for each resulting number, what values of the original arrays they correspond to. (So for this example, "3" refers to (11, 22).)

Here is my current algorithm:

import numpy as np

def groupify(values):
    group = np.zeros((len(values[0]),), dtype=np.int64) - 1 # Magic number: -1 means ungrouped.
    group_meanings = {}
    next_hash = 0
    matching = np.ones((len(values[0]),), dtype=bool)
    while any(group == -1):
        this_combo = {}

        matching[:] = (group == -1)
        first_ungrouped_idx = np.where(matching)[0][0]

        for curr_id, value_array in enumerate(values):
            needed_value = value_array[first_ungrouped_idx]
            matching[matching] = value_array[matching] == needed_value
            this_combo[curr_id] = needed_value
        # Assign all of the found elements to a new group
        group[matching] = next_hash
        group_meanings[next_hash] = this_combo
        next_hash += 1
    return group, group_meanings

Note that the expression value_array[matching] == needed_value is evaluated many times for each individual index, which is where the slowness comes from.

I'm not sure if my algorithm can be sped up much more, but I'm also not sure if it's the optimal algorithm to begin with. Is there a better way of doing this?

like image 674
acdr Avatar asked Jun 23 '16 09:06

acdr


3 Answers

Cracked it finally for a vectorized solution! It was an interesting problem. The problem was we had to tag each pair of values taken from the corresponding array elements of the list. Then, we are supposed to tag each such pair based on their uniqueness among othet pairs. So, we can use np.unique abusing all its optional arguments and finally do some additional work to keep the order for the final output. Here's the implementation basically done in three stages -

# Stack as a 2D array with each pair from values as a column each.
# Convert to linear index equivalent considering each column as indexing tuple
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)

# Do the heavy work with np.unique to give us :
#   1. Starting indices of unique elems, 
#   2. Srray that has unique IDs for each element in idx, and 
#   3. Group ID counts
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
                                        return_inverse=True,return_counts=True)

# Best part happens here : Use mask to ignore the repeated elems and re-tag 
# each unqID using argsort() of masked elements from idx
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
out = idx[mask].argsort()[unqID]

Runtime test

Let's compare the proposed vectorized approach against the original code. Since the proposed code gets us the group IDs only, so for a fair benchmarking, let's just trim off parts from the original code that are not used to give us that. So, here are the function definitions -

def groupify(values):  # Original code
    group = np.zeros((len(values[0]),), dtype=np.int64) - 1
    next_hash = 0
    matching = np.ones((len(values[0]),), dtype=bool)
    while any(group == -1):

        matching[:] = (group == -1)
        first_ungrouped_idx = np.where(matching)[0][0]

        for curr_id, value_array in enumerate(values):
            needed_value = value_array[first_ungrouped_idx]
            matching[matching] = value_array[matching] == needed_value
        # Assign all of the found elements to a new group
        group[matching] = next_hash
        next_hash += 1
    return group

def groupify_vectorized(values):  # Proposed code
    arr = np.vstack(values)
    idx = np.ravel_multi_index(arr,arr.max(1)+1)
    _,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
                                        return_inverse=True,return_counts=True)    
    mask = ~np.in1d(unqID,np.where(count>1)[0])
    mask[unq_start_idx] = 1
    return idx[mask].argsort()[unqID]

Runtime results on a list with large arrays -

In [345]: # Input list with random elements
     ...: values = [item for item in np.random.randint(10,40,(10,10000))]

In [346]: np.allclose(groupify(values),groupify_vectorized(values))
Out[346]: True

In [347]: %timeit groupify(values)
1 loops, best of 3: 4.02 s per loop

In [348]: %timeit groupify_vectorized(values)
100 loops, best of 3: 3.74 ms per loop
like image 90
Divakar Avatar answered Nov 14 '22 09:11

Divakar


This should work, and should be considerably faster, since we're using broadcasting and numpy's inherently fast boolean comparisons:

import numpy as np

# Test values:
values = [
    np.array([10, 11, 10, 11, 10, 11, 10]),
    np.array([21, 21, 22, 22, 21, 22, 23]),
    ]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])

# for every value in values, check where duplicate values occur
same_mask = [val[:,np.newaxis] == val[np.newaxis,:] for val in values]

# get the conjunction of all those tests
conjunction = np.logical_and.reduce(same_mask)

# ignore the diagonal
conjunction[np.diag_indices_from(conjunction)] = False

# initialize the labelled array with nans (used as flag)
labelled = np.empty(values[0].shape)
labelled.fill(np.nan)

# keep track of labelled value
val = 0
for k, row in enumerate(conjunction):
    if np.isnan(labelled[k]):  # this element has not been labelled yet
        labelled[k] = val      # so label it
        labelled[row] = val    # and label every element satisfying the test
        val += 1

print(labelled)
# outputs [ 0.  1.  2.  3.  0.  3.  4.]

It is about a factor of 1.5x faster than your version when dealing with the two arrays, but I suspect the speedup should be better for more arrays.

like image 22
EelkeSpaak Avatar answered Nov 14 '22 09:11

EelkeSpaak


The numpy_indexed package (disclaimer: I am its author) contains generalized variants of the numpy arrayset operations, which can be used to solve your problem in an elegant and efficient (vectorized) manner:

import numpy_indexed as npi
unique_values, labels = npi.unique(tuple(values), return_inverse=True)

The above will work for arbitrary type combinations, but alternatively, the below will be even more efficient if values is a list of many arrays of the same dtype:

unique_values, labels = npi.unique(np.asarray(values), axis=1, return_inverse=True)
like image 1
Eelco Hoogendoorn Avatar answered Nov 14 '22 07:11

Eelco Hoogendoorn