In order to improve memory efficiency, I've been working on converting some of my code from lists to generators/iterators where I can. I've found a lot of instances of cases where I am just converting a list I've made to an np.array
with the code pattern np.array(some_list)
.
Notably, some_list
is often a list comprehension that is iterating over a generator.
I was looking into np.fromiter
to see if I could use the generator more directly (rather than having to first cast it into a list to then convert it into an numpy array), but I noticed that the np.fromiter
function, unlike any other array creation routine that uses existing data requires specifying the dtype
.
In most of my particular cases, I can make that work(mostly dealing with loglikelihoods so float64 will be fine), but it left me wondering why it was that this is only necessary for the fromiter
array creator and not other array creators.
What I understand is that if you know the dtype
and the count
, it allows preallocating memory to the resulting np.array
, and that if you don't specify the optional count
argument that it will "resize the output array on demand". But if you do not specify the count, it would seem that you should be able to infer the dtype
on the fly in the same way that you can in a normal np.array
call.
I could see this being useful for recasting data into new dtype
s, but that would hold for other array creation routines as well, and would seem to merit placement as an optional but not required argument.
So why is it that you need to specify the dtype
to use np.fromiter
; or put another way what are the gains that result from specifying the dtype
if the array is going to be resized on demand anyway?
A more subtle version of the same question that is more directly related to my problem:
I know many of the efficiency gains of np.ndarray
s are lost when you're constantly resizing them, so what is gained from using np.fromiter(generator,dtype=d)
over np.fromiter([gen_elem for gen_elem in generator],dtype=d)
over np.array([gen_elem for gen_elem in generator],dtype=d)
?
Use the numpy. ones() function to create a numpy array of a specified shape that is is filled with the value one (1). The numpy. ones function is very similar to numpy.
Arrays. A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.
If we have an empty array and want to append new rows to it inside a loop, we can use the numpy. empty() function. Since no data type is assigned to a variable before initialization in Python, we have to specify the data type and structure of the array elements while creating the empty array.
If this code was written a decade ago, and there hasn't been pressure to change it, then the old reasons still apply. Most people are happy using np.array
. np.fromiter
is mainly used by people who are trying squeeze out some speed from iterative methods of generating values.
My impression is that np.array
, the main alternative reads/processes the whole input, before deciding on the dtype (and other properties):
I can force a float return just by changing one element:
In [395]: np.array([0,1,2,3,4,5])
Out[395]: array([0, 1, 2, 3, 4, 5])
In [396]: np.array([0,1,2,3,4,5,6.])
Out[396]: array([ 0., 1., 2., 3., 4., 5., 6.])
I don't use fromiter
much, but my sense is that by requiring dtype
, it can start converting the inputs to that type right from the start. That could end up producing a faster iteration, though that needs time tests.
I know that the np.array
generality comes at a certain time cost. Often for small lists it is faster to use a list comprehension than to convert it to an array - even though array operations are fast.
Some time tests:
In [404]: timeit np.fromiter([0,1,2,3,4,5,6.],dtype=int)
100000 loops, best of 3: 3.35 µs per loop
In [405]: timeit np.fromiter([0,1,2,3,4,5,6.],dtype=float)
100000 loops, best of 3: 3.88 µs per loop
In [406]: timeit np.array([0,1,2,3,4,5,6.])
100000 loops, best of 3: 4.51 µs per loop
In [407]: timeit np.array([0,1,2,3,4,5,6])
100000 loops, best of 3: 3.93 µs per loop
The differences are small, but suggest my reasoning is correct. Requiring dtype
helps keep fromiter
faster. count
does not make a difference in this small size.
Curiously, specifying a dtype
for np.array
slows it down. It's as though it appends a astype
call:
In [416]: timeit np.array([0,1,2,3,4,5,6],dtype=float)
100000 loops, best of 3: 6.52 µs per loop
In [417]: timeit np.array([0,1,2,3,4,5,6]).astype(float)
100000 loops, best of 3: 6.21 µs per loop
The differences between np.array
and np.fromiter
are more dramatic when I use range(1000)
(Python3 generator version)
In [430]: timeit np.array(range(1000))
1000 loops, best of 3: 704 µs per loop
Actually, turning the range into a list is faster:
In [431]: timeit np.array(list(range(1000)))
1000 loops, best of 3: 196 µs per loop
but fromiter
is still faster:
In [432]: timeit np.fromiter(range(1000),dtype=int)
10000 loops, best of 3: 87.6 µs per loop
It is faster to apply the int
to float
conversion on the whole array than to each element during the generation/iteration
In [434]: timeit np.fromiter(range(1000),dtype=int).astype(float)
10000 loops, best of 3: 106 µs per loop
In [435]: timeit np.fromiter(range(1000),dtype=float)
1000 loops, best of 3: 189 µs per loop
Note that the astype
resizing operation is not that expensive, only some 20 µs.
============================
array_fromiter(PyObject *NPY_UNUSED(ignored), PyObject *args, PyObject *keywds)
is defined in:
https://github.com/numpy/numpy/blob/eeba2cbfa4c56447e36aad6d97e323ecfbdade56/numpy/core/src/multiarray/multiarraymodule.c
It processes the keywds
and calls
PyArray_FromIter(PyObject *obj, PyArray_Descr *dtype, npy_intp count)
in
https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/core/src/multiarray/ctors.c
This makes an initial array ret
using the defined dtype
:
ret = (PyArrayObject *)PyArray_NewFromDescr(&PyArray_Type, dtype, 1,
&elcount, NULL,NULL, 0, NULL);
The data
attribute of this array is grown with 50% overallocation => 0, 4, 8, 14, 23, 36, 56, 86 ...
, and shrunk to fit at the end.
The dtype of this array, PyArray_DESCR(ret)
, apparently has a function that can take value
(provided by the iterator next
), convert it, and set it in the data
.
`(PyArray_DESCR(ret)->f->setitem(value, item, ret)`
In other words, all the dtype conversion is done by the defined dtype. The code would be lot more complicated if it decided 'on the fly' how to convert the value
(and all previously allocated ones). Most of the code in this function deals with allocating the data
buffer.
I'll hold off on looking up np.array
. I'm sure it is much more complex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With