Consider two numpy arrays
a = np.array(['john', 'bill', 'greg', 'bill', 'bill', 'greg', 'bill'])
b = np.array(['john', 'bill', 'greg'])
How would I be able to produce a third array
c = np.array([0,1,2,1,1,2,1])
The same length as a
representing the index of each entry of a
in the array b
?
I can see a way by looping over the elements of b
as b[i]
and checking np.where(a == b[i])
but was wondering if numpy could accomplish this in a quicker/better/less lines of code way.
Array indexing is the same as accessing an array element. You can access an array element by referring to its index number. The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second has index 1 etc.
Two-dimensional (2D) arrays are indexed by two subscripts, one for the row and one for the column. Each element in the 2D array must by the same type, either a primitive type or object type.
Use numpy. concatenate() to merge the content of two or multiple arrays into a single array. This function takes several arguments along with the NumPy arrays to concatenate and returns a Numpy array ndarray. Note that this method also takes axis as another argument, when not specified it defaults to 0.
Indexing in NumPy is a reasonably fast operation.
Here is one option:
import numpy as np
a = np.array(['john', 'bill', 'greg', 'bill', 'bill', 'greg', 'bill'])
b = np.array(['john', 'bill', 'greg'])
my_dict = dict(zip(b, range(len(b))))
result = np.vectorize(my_dict.get)(a)
Result:
>>> result
array([0, 1, 2, 1, 1, 2, 1])
Sorting is a good option for vectorization with numpy:
>>> s = np.argsort(b)
>>> s[np.searchsorted(b, a, sorter=s)]
array([0, 1, 2, 1, 1, 2, 1], dtype=int64)
If your array a
has m
elements and b
has n
, the sorting is going to be O(n log n), and the searching O(m log n), which is not bad. Dictionary based solutions should be amortized linear, but if the arrays are not huge the Python looping may make them slower than this. And broadcasting based solutions have quadratic complexity, they will only be faster for very small arrays.
Some timings with your sample:
In [3]: %%timeit
...: s = np.argsort(b)
...: np.take(s, np.searchsorted(b, a, sorter=s))
...:
100000 loops, best of 3: 4.16 µs per loop
In [5]: %%timeit
...: my_dict = dict(zip(b, range(len(b))))
...: np.vectorize(my_dict.get)(a)
...:
10000 loops, best of 3: 29.9 µs per loop
In [7]: %timeit (np.arange(b.size)*(a==b[:,newaxis]).T).sum(axis=-1)
100000 loops, best of 3: 18.5 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With