Let's say I have an array with a finite amount of unique values. Say
data = array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
And I also have a reference array with all the unique values found in data
, without repetitions and in a particular order. Say
reference = array([20, 10, 30])
And I want to create an array with the same shape than data
containing as values the indices in the reference
array where each element in the data
array is found.
In other words, having data
and reference
, I want to create an array indexes
such that the following holds.
data = reference[indexes]
A suboptimal approach to compute indexes
would be using a for loop, like this
indexes = np.zeros_like(data, dtype=int)
for i in range(data.size):
indexes[i] = np.where(data[i] == reference)[0]
but I'd be surprised there is not a numpythonic (and thus faster!) way to do this... Any ideas?
Thanks!
Arrays are classified as Homogeneous Data Structures because they store elements of the same type. They can store numbers, strings, boolean values (true and false), characters, objects, and so on. But once you define the type of values that your array will store, all its elements must be of that same type.
To select an element from Numpy Array , we can use [] operator i.e. It will return the element at given index only.
where() to find the index of an element in an array. Call numpy. where(condition) with condition as the syntax array = element to return the index of element in an array . For a 2D array, assign each resulting index to a unique variable.
We have data
and reference
as -
In [375]: data
Out[375]: array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
In [376]: reference
Out[376]: array([20, 10, 30])
For a moment, let us consider a sorted version of reference
-
In [373]: np.sort(reference)
Out[373]: array([10, 20, 30])
Now, we can use np.searchsorted
to find out the position of each data
element in this sorted version, like so -
In [378]: np.searchsorted(np.sort(reference), data, side='left')
Out[378]: array([2, 1, 2, 0, 1, 0, 1, 0, 2, 1, 1, 2, 2, 0, 2], dtype=int64)
If we run the original code, the expected output turns out to be -
In [379]: indexes
Out[379]: array([2, 0, 2, 1, 0, 1, 0, 1, 2, 0, 0, 2, 2, 1, 2])
As can be seen, the searchsorted
output is fine except the 0's
in it must be 1s
and 1's
must be changed to 0's
. Now, we had taken into computation, the sorted version of reference
. So, to do the 0's
to 1's
and vice versa changes, we need to bring in the indices used for sorting reference
, i.e. np.argsort(reference)
. That's basically it for a vectorized no-loop or no-dict approach! So, the final implementation would look something like this -
# Get sorting indices for reference
sort_idx = np.argsort(reference)
# Sort reference and get searchsorted indices for data in reference
pos = np.searchsorted(reference[sort_idx], data, side='left')
# Change pos indices based on sorted indices for reference
out = np.argsort(reference)[pos]
Runtime tests -
In [396]: data = np.random.randint(0,30000,150000)
...: reference = np.unique(data)
...: reference = reference[np.random.permutation(reference.size)]
...:
...:
...: def org_approach(data,reference):
...: indexes = np.zeros_like(data, dtype=int)
...: for i in range(data.size):
...: indexes[i] = np.where(data[i] == reference)[0]
...: return indexes
...:
...: def vect_approach(data,reference):
...: sort_idx = np.argsort(reference)
...: pos = np.searchsorted(reference[sort_idx], data, side='left')
...: return sort_idx[pos]
...:
In [397]: %timeit org_approach(data,reference)
1 loops, best of 3: 9.86 s per loop
In [398]: %timeit vect_approach(data,reference)
10 loops, best of 3: 32.4 ms per loop
Verify results -
In [399]: np.array_equal(org_approach(data,reference),vect_approach(data,reference))
Out[399]: True
You have to loop through the data once to map the data values onto indexes. The quickest way to do that is to look up the value indexes in a dictionary. So you need to create a dictionary from values to indexes first.
Here's a complete example:
import numpy
data = numpy.array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
reference = numpy.array([20, 10, 30])
reference_index = dict((value, index) for index, value in enumerate(reference))
indexes = [reference_index[value] for value in data]
assert numpy.all(data == reference[indexes])
This will be faster than the numpy.where
approach because numpy.where
will do a linear, O(n), search while the dictionary approach uses a hashtable to find the index in O(1) time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With