I have some data which is stored as a numpy array with dtype=object
, and I would like to extract one column of lists and convert it to a numpy array. It seems like a simple problem, but the only way I've found to solve it is to recast the entire thing as a list of lists and then recast it as a numpy array. Is there a more pythonic approach?
import numpy as np
arr = np.array([[1, ['a', 'b', 'c']], [2, ['a', 'b', 'c']]], dtype=object)
arr = arr[:, 1]
print(arr)
# [['a', 'b', 'c'] ['a', 'b', 'c']]
type(arr)
# numpy.ndarray
type(arr[0])
# list
arr.shape
# (2,)
Recasting the array as dtype=str
raises a ValueError
since it is trying to convert each list to a string.
arr.astype(str)
# ValueError: setting an array element with a sequence
It is possible to rebuild the entire array as a list of lists and then cast it as a numpy array, but this seems like a roundabout way.
arr_2 = np.array(list(arr))
type(arr_2)
# numpy.ndarray
type(arr_2[0])
# numpy.ndarray
arr_2.shape
# (2, 3)
Is there a better way to do this?
One way would be to use stacking operations with something like np.vstack
-
np.vstack(arr[:, 1])
Sample run -
In [234]: arr
Out[234]:
array([[1, ['a', 'b', 'c']],
[2, ['a', 'b', 'c']]], dtype=object)
In [235]: arr[:,1]
Out[235]: array([['a', 'b', 'c'], ['a', 'b', 'c']], dtype=object)
In [236]: np.vstack(arr[:, 1])
Out[236]:
array([['a', 'b', 'c'],
['a', 'b', 'c']],
dtype='|S1')
I believe np.vstack
would internally use np.concatenate
. So, to directly use it, we would have -
np.concatenate(arr[:, 1]).reshape(len(arr),-1)
Though going by way of lists is faster than by way of vstack
:
In [1617]: timeit np.array(arr[:,1].tolist())
...
100000 loops, best of 3: 11.5 µs per loop
In [1618]: timeit np.vstack(arr[:,1])
...
10000 loops, best of 3: 54.1 µs per loop
vstack
is doing:
np.concatenate([np.atleast_2d(a) for a in arr[:,1]],axis=0)
Some alternatives:
In [1627]: timeit np.array([a for a in arr[:,1]])
100000 loops, best of 3: 18.6 µs per loop
In [1629]: timeit np.stack(arr[:,1],axis=0)
10000 loops, best of 3: 60.2 µs per loop
Keep in mind that the object array just contains pointers to the lists which are else where in memory. While the 2d nature of arr
makes it easy to select the 2nd column, arr[:,1]
is effectively a list of lists. And most operations on it treat it as such. Things like reshape
don't cross that object
boundary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With