Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Classify elements of a numpy array using a second array as reference

Let's say I have an array with a finite amount of unique values. Say

data = array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])

And I also have a reference array with all the unique values found in data, without repetitions and in a particular order. Say

reference = array([20, 10, 30])

And I want to create an array with the same shape than data containing as values the indices in the reference array where each element in the data array is found.

In other words, having data and reference, I want to create an array indexes such that the following holds.

data = reference[indexes]

A suboptimal approach to compute indexes would be using a for loop, like this

indexes = np.zeros_like(data, dtype=int)
for i in range(data.size):
    indexes[i] = np.where(data[i] == reference)[0]

but I'd be surprised there is not a numpythonic (and thus faster!) way to do this... Any ideas?

Thanks!

like image 640
mgab Avatar asked Jun 26 '15 16:06

mgab


People also ask

How do you classify an array?

Arrays are classified as Homogeneous Data Structures because they store elements of the same type. They can store numbers, strings, boolean values (true and false), characters, objects, and so on. But once you define the type of values that your array will store, all its elements must be of that same type.

How do you select an element from a NumPy array?

To select an element from Numpy Array , we can use [] operator i.e. It will return the element at given index only.

How do you find the index of an element in a 2D NumPy array?

where() to find the index of an element in an array. Call numpy. where(condition) with condition as the syntax array = element to return the index of element in an array . For a 2D array, assign each resulting index to a unique variable.


Video Answer


2 Answers

We have data and reference as -

In [375]: data
Out[375]: array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])

In [376]: reference
Out[376]: array([20, 10, 30])

For a moment, let us consider a sorted version of reference -

In [373]: np.sort(reference)
Out[373]: array([10, 20, 30])

Now, we can use np.searchsorted to find out the position of each data element in this sorted version, like so -

In [378]: np.searchsorted(np.sort(reference), data, side='left')
Out[378]: array([2, 1, 2, 0, 1, 0, 1, 0, 2, 1, 1, 2, 2, 0, 2], dtype=int64)

If we run the original code, the expected output turns out to be -

In [379]: indexes
Out[379]: array([2, 0, 2, 1, 0, 1, 0, 1, 2, 0, 0, 2, 2, 1, 2])

As can be seen, the searchsorted output is fine except the 0's in it must be 1s and 1's must be changed to 0's. Now, we had taken into computation, the sorted version of reference. So, to do the 0's to 1's and vice versa changes, we need to bring in the indices used for sorting reference, i.e. np.argsort(reference). That's basically it for a vectorized no-loop or no-dict approach! So, the final implementation would look something like this -

# Get sorting indices for reference
sort_idx = np.argsort(reference)

# Sort reference and get searchsorted indices for data in reference
pos = np.searchsorted(reference[sort_idx], data, side='left')

# Change pos indices based on sorted indices for reference
out = np.argsort(reference)[pos]

Runtime tests -

In [396]: data = np.random.randint(0,30000,150000)
     ...: reference = np.unique(data)
     ...: reference = reference[np.random.permutation(reference.size)]
     ...: 
     ...: 
     ...: def org_approach(data,reference):
     ...:     indexes = np.zeros_like(data, dtype=int)
     ...:     for i in range(data.size):
     ...:         indexes[i] = np.where(data[i] == reference)[0]
     ...:     return indexes
     ...: 
     ...: def vect_approach(data,reference):
     ...:     sort_idx = np.argsort(reference)
     ...:     pos = np.searchsorted(reference[sort_idx], data, side='left')       
     ...:     return sort_idx[pos]
     ...: 

In [397]: %timeit org_approach(data,reference)
1 loops, best of 3: 9.86 s per loop

In [398]: %timeit vect_approach(data,reference)
10 loops, best of 3: 32.4 ms per loop

Verify results -

In [399]: np.array_equal(org_approach(data,reference),vect_approach(data,reference))
Out[399]: True
like image 186
Divakar Avatar answered Oct 17 '22 17:10

Divakar


You have to loop through the data once to map the data values onto indexes. The quickest way to do that is to look up the value indexes in a dictionary. So you need to create a dictionary from values to indexes first.

Here's a complete example:

import numpy

data = numpy.array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
reference = numpy.array([20, 10, 30])
reference_index = dict((value, index) for index, value in enumerate(reference))
indexes = [reference_index[value] for value in data]
assert numpy.all(data == reference[indexes])

This will be faster than the numpy.where approach because numpy.where will do a linear, O(n), search while the dictionary approach uses a hashtable to find the index in O(1) time.

like image 20
Daniel Renshaw Avatar answered Oct 17 '22 19:10

Daniel Renshaw