Classify elements of a numpy array using a second array as reference

Tags:

Let's say I have an array with a finite amount of unique values. Say

data = array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])

And I also have a reference array with all the unique values found in data, without repetitions and in a particular order. Say

reference = array([20, 10, 30])

And I want to create an array with the same shape than data containing as values the indices in the reference array where each element in the data array is found.

In other words, having data and reference, I want to create an array indexes such that the following holds.

data = reference[indexes]

A suboptimal approach to compute indexes would be using a for loop, like this

indexes = np.zeros_like(data, dtype=int)
for i in range(data.size):
    indexes[i] = np.where(data[i] == reference)[0]

but I'd be surprised there is not a numpythonic (and thus faster!) way to do this... Any ideas?

Thanks!

640

asked Jun 26 '15 16:06

mgab

Video Answer

2 Answers

We have data and reference as -

In [375]: data
Out[375]: array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])

In [376]: reference
Out[376]: array([20, 10, 30])

For a moment, let us consider a sorted version of reference -

In [373]: np.sort(reference)
Out[373]: array([10, 20, 30])

Now, we can use np.searchsorted to find out the position of each data element in this sorted version, like so -

In [378]: np.searchsorted(np.sort(reference), data, side='left')
Out[378]: array([2, 1, 2, 0, 1, 0, 1, 0, 2, 1, 1, 2, 2, 0, 2], dtype=int64)

If we run the original code, the expected output turns out to be -

In [379]: indexes
Out[379]: array([2, 0, 2, 1, 0, 1, 0, 1, 2, 0, 0, 2, 2, 1, 2])

As can be seen, the searchsorted output is fine except the 0's in it must be 1s and 1's must be changed to 0's. Now, we had taken into computation, the sorted version of reference. So, to do the 0's to 1's and vice versa changes, we need to bring in the indices used for sorting reference, i.e. np.argsort(reference). That's basically it for a vectorized no-loop or no-dict approach! So, the final implementation would look something like this -

# Get sorting indices for reference
sort_idx = np.argsort(reference)

# Sort reference and get searchsorted indices for data in reference
pos = np.searchsorted(reference[sort_idx], data, side='left')

# Change pos indices based on sorted indices for reference
out = np.argsort(reference)[pos]

Runtime tests -

In [396]: data = np.random.randint(0,30000,150000)
     ...: reference = np.unique(data)
     ...: reference = reference[np.random.permutation(reference.size)]
     ...: 
     ...: 
     ...: def org_approach(data,reference):
     ...:     indexes = np.zeros_like(data, dtype=int)
     ...:     for i in range(data.size):
     ...:         indexes[i] = np.where(data[i] == reference)[0]
     ...:     return indexes
     ...: 
     ...: def vect_approach(data,reference):
     ...:     sort_idx = np.argsort(reference)
     ...:     pos = np.searchsorted(reference[sort_idx], data, side='left')       
     ...:     return sort_idx[pos]
     ...: 

In [397]: %timeit org_approach(data,reference)
1 loops, best of 3: 9.86 s per loop

In [398]: %timeit vect_approach(data,reference)
10 loops, best of 3: 32.4 ms per loop

Verify results -

In [399]: np.array_equal(org_approach(data,reference),vect_approach(data,reference))
Out[399]: True

186

answered Oct 17 '22 17:10

Divakar

You have to loop through the data once to map the data values onto indexes. The quickest way to do that is to look up the value indexes in a dictionary. So you need to create a dictionary from values to indexes first.

Here's a complete example:

import numpy

data = numpy.array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
reference = numpy.array([20, 10, 30])
reference_index = dict((value, index) for index, value in enumerate(reference))
indexes = [reference_index[value] for value in data]
assert numpy.all(data == reference[indexes])

This will be faster than the numpy.where approach because numpy.where will do a linear, O(n), search while the dictionary approach uses a hashtable to find the index in O(1) time.

answered Oct 17 '22 19:10

Daniel Renshaw

Related questions
                            
                                Join lists by value
                            
                                python if statement returns value error
                            
                                Generating N uniform random numbers that sum to M
                            
                                Django 'str' object is not callable. How to deal with it?
                            
                                What is python's not? A special function type?
                            
                                ImportError: No module named opencv after installing python-opencv
                            
                                "Access is denied" while upgrading pip.exe on Windows
                            
                                Why does collections.OrderedDict use try and except to initialize variables?
                            
                                Why does multiprocessing.Queue have no task_done method
                            
                                Celery Logging: consistent way to log inside and outside of a task
                            
                                celery worker not working though rabbitmq has queue buildup
                            
                                Python root logger messages not being logged via handler configured with fileConfig
                            
                                <method-wrapper '__call__' of functools.partial object at 0x1356e10> is not a Python function
                            
                                Extract substrings in python
                            
                                openssl hmac differ from python hmac
                            
                                How to set first column to a constant value of an empty np.zeros numPy matrix? [duplicate]
                            
                                Find Maximum of 3D np.array along Axis = 0
                            
                                what is the equivalent command 'end' of Matlab in python? [duplicate]
                            
                                Check if iterator is sorted
                            
                                Reading a text file and calculating probability and Shannon's entropy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Classify elements of a numpy array using a second array as reference

Tags:

performance

python

arrays

numpy