Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does numpy's fromiter function require specifying the dtype when other array creation routines don't?

In order to improve memory efficiency, I've been working on converting some of my code from lists to generators/iterators where I can. I've found a lot of instances of cases where I am just converting a list I've made to an np.array with the code pattern np.array(some_list).

Notably, some_list is often a list comprehension that is iterating over a generator.

I was looking into np.fromiter to see if I could use the generator more directly (rather than having to first cast it into a list to then convert it into an numpy array), but I noticed that the np.fromiter function, unlike any other array creation routine that uses existing data requires specifying the dtype.

In most of my particular cases, I can make that work(mostly dealing with loglikelihoods so float64 will be fine), but it left me wondering why it was that this is only necessary for the fromiter array creator and not other array creators.

First attempts at a guess:

Memory preallocation?

What I understand is that if you know the dtype and the count, it allows preallocating memory to the resulting np.array, and that if you don't specify the optional count argument that it will "resize the output array on demand". But if you do not specify the count, it would seem that you should be able to infer the dtype on the fly in the same way that you can in a normal np.array call.

Datatype recasting?

I could see this being useful for recasting data into new dtypes, but that would hold for other array creation routines as well, and would seem to merit placement as an optional but not required argument.

A couple ways of restating the question

So why is it that you need to specify the dtype to use np.fromiter; or put another way what are the gains that result from specifying the dtype if the array is going to be resized on demand anyway?

A more subtle version of the same question that is more directly related to my problem: I know many of the efficiency gains of np.ndarrays are lost when you're constantly resizing them, so what is gained from using np.fromiter(generator,dtype=d) over np.fromiter([gen_elem for gen_elem in generator],dtype=d) over np.array([gen_elem for gen_elem in generator],dtype=d)?

like image 297
mpacer Avatar asked Dec 01 '15 22:12

mpacer


People also ask

How do I create a Numpy array of a certain shape?

Use the numpy. ones() function to create a numpy array of a specified shape that is is filled with the value one (1). The numpy. ones function is very similar to numpy.

How do you define an array in NP?

Arrays. A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

How do you create an empty Numpy array and append?

If we have an empty array and want to append new rows to it inside a loop, we can use the numpy. empty() function. Since no data type is assigned to a variable before initialization in Python, we have to specify the data type and structure of the array elements while creating the empty array.


1 Answers

If this code was written a decade ago, and there hasn't been pressure to change it, then the old reasons still apply. Most people are happy using np.array. np.fromiter is mainly used by people who are trying squeeze out some speed from iterative methods of generating values.

My impression is that np.array, the main alternative reads/processes the whole input, before deciding on the dtype (and other properties):

I can force a float return just by changing one element:

In [395]: np.array([0,1,2,3,4,5])
Out[395]: array([0, 1, 2, 3, 4, 5])
In [396]: np.array([0,1,2,3,4,5,6.])
Out[396]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.])

I don't use fromiter much, but my sense is that by requiring dtype, it can start converting the inputs to that type right from the start. That could end up producing a faster iteration, though that needs time tests.

I know that the np.array generality comes at a certain time cost. Often for small lists it is faster to use a list comprehension than to convert it to an array - even though array operations are fast.

Some time tests:

In [404]: timeit np.fromiter([0,1,2,3,4,5,6.],dtype=int)
100000 loops, best of 3: 3.35 µs per loop
In [405]: timeit np.fromiter([0,1,2,3,4,5,6.],dtype=float)
100000 loops, best of 3: 3.88 µs per loop
In [406]: timeit np.array([0,1,2,3,4,5,6.])
100000 loops, best of 3: 4.51 µs per loop
In [407]: timeit np.array([0,1,2,3,4,5,6])
100000 loops, best of 3: 3.93 µs per loop

The differences are small, but suggest my reasoning is correct. Requiring dtype helps keep fromiter faster. count does not make a difference in this small size.

Curiously, specifying a dtype for np.array slows it down. It's as though it appends a astype call:

In [416]: timeit np.array([0,1,2,3,4,5,6],dtype=float)
100000 loops, best of 3: 6.52 µs per loop
In [417]: timeit np.array([0,1,2,3,4,5,6]).astype(float)
100000 loops, best of 3: 6.21 µs per loop

The differences between np.array and np.fromiter are more dramatic when I use range(1000) (Python3 generator version)

In [430]: timeit np.array(range(1000))
1000 loops, best of 3: 704 µs per loop

Actually, turning the range into a list is faster:

In [431]: timeit np.array(list(range(1000)))
1000 loops, best of 3: 196 µs per loop

but fromiter is still faster:

In [432]: timeit np.fromiter(range(1000),dtype=int)
10000 loops, best of 3: 87.6 µs per loop

It is faster to apply the int to float conversion on the whole array than to each element during the generation/iteration

In [434]: timeit np.fromiter(range(1000),dtype=int).astype(float)
10000 loops, best of 3: 106 µs per loop
In [435]: timeit np.fromiter(range(1000),dtype=float)
1000 loops, best of 3: 189 µs per loop

Note that the astype resizing operation is not that expensive, only some 20 µs.

============================

array_fromiter(PyObject *NPY_UNUSED(ignored), PyObject *args, PyObject *keywds) is defined in:

https://github.com/numpy/numpy/blob/eeba2cbfa4c56447e36aad6d97e323ecfbdade56/numpy/core/src/multiarray/multiarraymodule.c

It processes the keywds and calls PyArray_FromIter(PyObject *obj, PyArray_Descr *dtype, npy_intp count) in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/core/src/multiarray/ctors.c

This makes an initial array ret using the defined dtype:

ret = (PyArrayObject *)PyArray_NewFromDescr(&PyArray_Type, dtype, 1,
                                            &elcount, NULL,NULL, 0, NULL);

The data attribute of this array is grown with 50% overallocation => 0, 4, 8, 14, 23, 36, 56, 86 ..., and shrunk to fit at the end.

The dtype of this array, PyArray_DESCR(ret), apparently has a function that can take value (provided by the iterator next), convert it, and set it in the data.

`(PyArray_DESCR(ret)->f->setitem(value, item, ret)`

In other words, all the dtype conversion is done by the defined dtype. The code would be lot more complicated if it decided 'on the fly' how to convert the value (and all previously allocated ones). Most of the code in this function deals with allocating the data buffer.

I'll hold off on looking up np.array. I'm sure it is much more complex.

like image 198
hpaulj Avatar answered Nov 14 '22 23:11

hpaulj