Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert Python sequence to NumPy array, filling missing values

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.

v = [[1], [1, 2]] np.array(v) >>> array([[1], [1, 2]], dtype=object) 

Trying to force another type will cause an exception:

np.array(v, dtype=np.int32) ValueError: setting an array element with a sequence. 

What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?

From my sample sequence v, I would like to get something like this, if 0 is the placeholder

array([[1, 0], [1, 2]], dtype=int32) 
like image 780
Marco Ancona Avatar asked Jul 27 '16 17:07

Marco Ancona


People also ask

How do I replace missing values in NumPy?

In NumPy, to replace missing values NaN ( np. nan ) in ndarray with other numbers, use np. nan_to_num() or np. isnan() .

How do I drop all missing values from a NumPy array?

How to drop all missing values from a numpy array? Droping the missing values or nan values can be done by using the function "numpy. isnan()" it will give us the indexes which are having nan values and when combined with other function which is "numpy. logical_not()" where the boolean values will be reversed.

Can we convert list to NumPy array?

You can convert a list to a NumPy array by passing a list to numpy. array() . The data type dtype of generated numpy. ndarray is automatically determined from the original list but can also be specified with the dtype parameter.

Can you add elements to NumPy array?

Add array element You can add a NumPy array element by using the append() method of the NumPy module. The values will be appended at the end of the array and a new ndarray will be returned with new and old values as shown above. The axis is an optional integer along which define how the array is going to be displayed.


2 Answers

You can use itertools.zip_longest:

import itertools np.array(list(itertools.zip_longest(*v, fillvalue=0))).T Out:  array([[1, 0],        [1, 2]]) 

Note: For Python 2, it is itertools.izip_longest.

like image 173
ayhan Avatar answered Sep 28 '22 04:09

ayhan


Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -

def boolean_indexing(v):     lens = np.array([len(item) for item in v])     mask = lens[:,None] > np.arange(lens.max())     out = np.zeros(mask.shape,dtype=int)     out[mask] = np.concatenate(v)     return out 

Sample run

In [27]: v Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]  In [28]: out Out[28]:  array([[1, 0, 0, 0, 0],        [1, 2, 0, 0, 0],        [3, 6, 7, 8, 9],        [4, 0, 0, 0, 0]]) 

*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.

Runtime test

In this section I am timing DataFrame-based solution by @Alberto Garcia-Raboso, itertools-based solution by @ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.

Case #1 : Larger size variation

In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]  In [45]: v = v*1000  In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32) 100 loops, best of 3: 9.82 ms per loop  In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T 100 loops, best of 3: 5.11 ms per loop  In [48]: %timeit boolean_indexing(v) 100 loops, best of 3: 6.88 ms per loop 

Case #2 : Lesser size variation

In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]  In [50]: v = v*1000  In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32) 100 loops, best of 3: 3.12 ms per loop  In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T 1000 loops, best of 3: 1.55 ms per loop  In [53]: %timeit boolean_indexing(v) 100 loops, best of 3: 5 ms per loop 

Case #3 : Larger number of elements (100 max) per list element

In [139]: # Setup inputs      ...: N = 10000 # Number of elems in list      ...: maxn = 100 # Max. size of a list element      ...: lens = np.random.randint(0,maxn,(N))      ...: v = [list(np.random.randint(0,9,(L))) for L in lens]      ...:   In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32) 1 loops, best of 3: 292 ms per loop  In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T 1 loops, best of 3: 264 ms per loop  In [142]: %timeit boolean_indexing(v) 10 loops, best of 3: 95.7 ms per loop 

To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!

like image 25
Divakar Avatar answered Sep 28 '22 03:09

Divakar