The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.
v = [[1], [1, 2]] np.array(v) >>> array([[1], [1, 2]], dtype=object)
Trying to force another type will cause an exception:
np.array(v, dtype=np.int32) ValueError: setting an array element with a sequence.
What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?
From my sample sequence v
, I would like to get something like this, if 0 is the placeholder
array([[1, 0], [1, 2]], dtype=int32)
In NumPy, to replace missing values NaN ( np. nan ) in ndarray with other numbers, use np. nan_to_num() or np. isnan() .
How to drop all missing values from a numpy array? Droping the missing values or nan values can be done by using the function "numpy. isnan()" it will give us the indexes which are having nan values and when combined with other function which is "numpy. logical_not()" where the boolean values will be reversed.
You can convert a list to a NumPy array by passing a list to numpy. array() . The data type dtype of generated numpy. ndarray is automatically determined from the original list but can also be specified with the dtype parameter.
Add array element You can add a NumPy array element by using the append() method of the NumPy module. The values will be appended at the end of the array and a new ndarray will be returned with new and old values as shown above. The axis is an optional integer along which define how the array is going to be displayed.
You can use itertools.zip_longest:
import itertools np.array(list(itertools.zip_longest(*v, fillvalue=0))).T Out: array([[1, 0], [1, 2]])
Note: For Python 2, it is itertools.izip_longest.
Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -
def boolean_indexing(v): lens = np.array([len(item) for item in v]) mask = lens[:,None] > np.arange(lens.max()) out = np.zeros(mask.shape,dtype=int) out[mask] = np.concatenate(v) return out
Sample run
In [27]: v Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]] In [28]: out Out[28]: array([[1, 0, 0, 0, 0], [1, 2, 0, 0, 0], [3, 6, 7, 8, 9], [4, 0, 0, 0, 0]])
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.
Runtime test
In this section I am timing DataFrame-based solution by @Alberto Garcia-Raboso
, itertools-based solution by @ayhan
as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.
Case #1 : Larger size variation
In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]] In [45]: v = v*1000 In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32) 100 loops, best of 3: 9.82 ms per loop In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T 100 loops, best of 3: 5.11 ms per loop In [48]: %timeit boolean_indexing(v) 100 loops, best of 3: 6.88 ms per loop
Case #2 : Lesser size variation
In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]] In [50]: v = v*1000 In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32) 100 loops, best of 3: 3.12 ms per loop In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T 1000 loops, best of 3: 1.55 ms per loop In [53]: %timeit boolean_indexing(v) 100 loops, best of 3: 5 ms per loop
Case #3 : Larger number of elements (100 max) per list element
In [139]: # Setup inputs ...: N = 10000 # Number of elems in list ...: maxn = 100 # Max. size of a list element ...: lens = np.random.randint(0,maxn,(N)) ...: v = [list(np.random.randint(0,9,(L))) for L in lens] ...: In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32) 1 loops, best of 3: 292 ms per loop In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T 1 loops, best of 3: 264 ms per loop In [142]: %timeit boolean_indexing(v) 10 loops, best of 3: 95.7 ms per loop
To me, it seems there's no clear winner, but would have to be taken on a case-by-case basis!itertools.izip_longest
is doing pretty well!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With