So what is a concise and efficient way to convert a numpy array like:
[[0, 0, 1],
[1, 0, 0],
[0, 1, 0]]
into a column like:
[[2],
 [0],
 [1]]
where the number in each column is the index value of the "1" in the original array of one hot vectors?
I was thinking of looping through the rows and creating a list of the index value of 1, but I wonder if there is a more efficient way to do it. Thank you for any suggestions.
Update: For a faster solution, see Divakar's answer.
You can use the nonzero() method of the numpy array.  The second element of the tuple that it returns is what you want.  For example,
In [56]: x
Out[56]: 
array([[0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1],
       [0, 0, 0, 1],
       [1, 0, 0, 0]])
In [57]: x.nonzero()[1]
Out[57]: array([2, 2, 3, 3, 0])
According to the docstring of numpy.nonzero(), "the values in a are always tested and returned in row-major, C-style order", so as long as you have exactly one 1 in each row, x.nonzero()[1] will give the positions of the 1 in each row, starting from the first row. (And x.nonzero()[0] will be equal to range(x.shape[0]).)
To get the result as an array with shape (n, 1), you can use the reshape() method
In [59]: x.nonzero()[1].reshape(-1, 1)
Out[59]: 
array([[2],
       [2],
       [3],
       [3],
       [0]])
or you can index with [:, np.newaxis]:
In [60]: x.nonzero()[1][:, np.newaxis]
Out[60]: 
array([[2],
       [2],
       [3],
       [3],
       [0]])
We are working with hot-encoded array that guarantees us exactly one 1 per row. So, if we just look for the first non-zero index per row, we would have the desired result. Thus, we could use np.argmax along each row, like so -
a.argmax(axis=1)
If you wanted a 2D array as o/p, simply add a singleton dimension at the end -
a.argmax(axis=1)[:,None]
Runtime test -
In [20]: # Let's create a sample hot encoded array
    ...: a = np.zeros((1000,1000),dtype=int)
    ...: idx = np.random.randint(0,1000,1000)
    ...: a[np.arange(1000),idx] = 1
    ...: 
In [21]: %timeit a.nonzero()[1] # @Warren Weckesser's soln
100 loops, best of 3: 9.03 ms per loop
In [22]: %timeit a.argmax(axis=1)
1000 loops, best of 3: 1.15 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With