Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy Indexing of 2 Arrays

Consider two numpy arrays

a = np.array(['john', 'bill', 'greg', 'bill', 'bill', 'greg', 'bill'])
b = np.array(['john', 'bill', 'greg'])

How would I be able to produce a third array

c = np.array([0,1,2,1,1,2,1])

The same length as a representing the index of each entry of a in the array b?

I can see a way by looping over the elements of b as b[i] and checking np.where(a == b[i]) but was wondering if numpy could accomplish this in a quicker/better/less lines of code way.

like image 869
rwolst Avatar asked May 12 '14 15:05

rwolst


People also ask

Can NumPy arrays be indexed?

Array indexing is the same as accessing an array element. You can access an array element by referring to its index number. The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second has index 1 etc.

How is a 2D array indexed?

Two-dimensional (2D) arrays are indexed by two subscripts, one for the row and one for the column. Each element in the 2D array must by the same type, either a primitive type or object type.

How do I connect two arrays in NumPy?

Use numpy. concatenate() to merge the content of two or multiple arrays into a single array. This function takes several arguments along with the NumPy arrays to concatenate and returns a Numpy array ndarray. Note that this method also takes axis as another argument, when not specified it defaults to 0.

Is NumPy indexing fast?

Indexing in NumPy is a reasonably fast operation.


2 Answers

Here is one option:

import numpy as np

a = np.array(['john', 'bill', 'greg', 'bill', 'bill', 'greg', 'bill'])
b = np.array(['john', 'bill', 'greg'])

my_dict = dict(zip(b, range(len(b))))

result = np.vectorize(my_dict.get)(a)

Result:

>>> result
array([0, 1, 2, 1, 1, 2, 1])
like image 51
Akavall Avatar answered Sep 28 '22 13:09

Akavall


Sorting is a good option for vectorization with numpy:

>>> s = np.argsort(b)
>>> s[np.searchsorted(b, a, sorter=s)]
array([0, 1, 2, 1, 1, 2, 1], dtype=int64)

If your array a has m elements and b has n, the sorting is going to be O(n log n), and the searching O(m log n), which is not bad. Dictionary based solutions should be amortized linear, but if the arrays are not huge the Python looping may make them slower than this. And broadcasting based solutions have quadratic complexity, they will only be faster for very small arrays.


Some timings with your sample:

In [3]: %%timeit
   ...: s = np.argsort(b)
   ...: np.take(s, np.searchsorted(b, a, sorter=s))
   ...: 
100000 loops, best of 3: 4.16 µs per loop

In [5]: %%timeit
   ...: my_dict = dict(zip(b, range(len(b))))
   ...: np.vectorize(my_dict.get)(a)
   ...: 
10000 loops, best of 3: 29.9 µs per loop

In [7]: %timeit (np.arange(b.size)*(a==b[:,newaxis]).T).sum(axis=-1)
100000 loops, best of 3: 18.5 µs per loop
like image 28
Jaime Avatar answered Sep 28 '22 11:09

Jaime