Why does numpy's fromiter function require specifying the dtype when other array creation routines don't?

Tags:

In order to improve memory efficiency, I've been working on converting some of my code from lists to generators/iterators where I can. I've found a lot of instances of cases where I am just converting a list I've made to an np.array with the code pattern np.array(some_list).

Notably, some_list is often a list comprehension that is iterating over a generator.

I was looking into np.fromiter to see if I could use the generator more directly (rather than having to first cast it into a list to then convert it into an numpy array), but I noticed that the np.fromiter function, unlike any other array creation routine that uses existing data requires specifying the dtype.

In most of my particular cases, I can make that work(mostly dealing with loglikelihoods so float64 will be fine), but it left me wondering why it was that this is only necessary for the fromiter array creator and not other array creators.

First attempts at a guess:

Memory preallocation?

What I understand is that if you know the dtype and the count, it allows preallocating memory to the resulting np.array, and that if you don't specify the optional count argument that it will "resize the output array on demand". But if you do not specify the count, it would seem that you should be able to infer the dtype on the fly in the same way that you can in a normal np.array call.

Datatype recasting?

I could see this being useful for recasting data into new dtypes, but that would hold for other array creation routines as well, and would seem to merit placement as an optional but not required argument.

A couple ways of restating the question

So why is it that you need to specify the dtype to use np.fromiter; or put another way what are the gains that result from specifying the dtype if the array is going to be resized on demand anyway?

A more subtle version of the same question that is more directly related to my problem: I know many of the efficiency gains of np.ndarrays are lost when you're constantly resizing them, so what is gained from using np.fromiter(generator,dtype=d) over np.fromiter([gen_elem for gen_elem in generator],dtype=d) over np.array([gen_elem for gen_elem in generator],dtype=d)?

297

asked Dec 01 '15 22:12

mpacer

1 Answers

If this code was written a decade ago, and there hasn't been pressure to change it, then the old reasons still apply. Most people are happy using np.array. np.fromiter is mainly used by people who are trying squeeze out some speed from iterative methods of generating values.

My impression is that np.array, the main alternative reads/processes the whole input, before deciding on the dtype (and other properties):

I can force a float return just by changing one element:

In [395]: np.array([0,1,2,3,4,5])
Out[395]: array([0, 1, 2, 3, 4, 5])
In [396]: np.array([0,1,2,3,4,5,6.])
Out[396]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.])

I don't use fromiter much, but my sense is that by requiring dtype, it can start converting the inputs to that type right from the start. That could end up producing a faster iteration, though that needs time tests.

I know that the np.array generality comes at a certain time cost. Often for small lists it is faster to use a list comprehension than to convert it to an array - even though array operations are fast.

Some time tests:

In [404]: timeit np.fromiter([0,1,2,3,4,5,6.],dtype=int)
100000 loops, best of 3: 3.35 µs per loop
In [405]: timeit np.fromiter([0,1,2,3,4,5,6.],dtype=float)
100000 loops, best of 3: 3.88 µs per loop
In [406]: timeit np.array([0,1,2,3,4,5,6.])
100000 loops, best of 3: 4.51 µs per loop
In [407]: timeit np.array([0,1,2,3,4,5,6])
100000 loops, best of 3: 3.93 µs per loop

The differences are small, but suggest my reasoning is correct. Requiring dtype helps keep fromiter faster. count does not make a difference in this small size.

Curiously, specifying a dtype for np.array slows it down. It's as though it appends a astype call:

In [416]: timeit np.array([0,1,2,3,4,5,6],dtype=float)
100000 loops, best of 3: 6.52 µs per loop
In [417]: timeit np.array([0,1,2,3,4,5,6]).astype(float)
100000 loops, best of 3: 6.21 µs per loop

The differences between np.array and np.fromiter are more dramatic when I use range(1000) (Python3 generator version)

In [430]: timeit np.array(range(1000))
1000 loops, best of 3: 704 µs per loop

Actually, turning the range into a list is faster:

In [431]: timeit np.array(list(range(1000)))
1000 loops, best of 3: 196 µs per loop

but fromiter is still faster:

In [432]: timeit np.fromiter(range(1000),dtype=int)
10000 loops, best of 3: 87.6 µs per loop

It is faster to apply the int to float conversion on the whole array than to each element during the generation/iteration

In [434]: timeit np.fromiter(range(1000),dtype=int).astype(float)
10000 loops, best of 3: 106 µs per loop
In [435]: timeit np.fromiter(range(1000),dtype=float)
1000 loops, best of 3: 189 µs per loop

Note that the astype resizing operation is not that expensive, only some 20 µs.

============================

array_fromiter(PyObject *NPY_UNUSED(ignored), PyObject *args, PyObject *keywds) is defined in:

https://github.com/numpy/numpy/blob/eeba2cbfa4c56447e36aad6d97e323ecfbdade56/numpy/core/src/multiarray/multiarraymodule.c

It processes the keywds and calls PyArray_FromIter(PyObject *obj, PyArray_Descr *dtype, npy_intp count) in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/core/src/multiarray/ctors.c

This makes an initial array ret using the defined dtype:

ret = (PyArrayObject *)PyArray_NewFromDescr(&PyArray_Type, dtype, 1,
                                            &elcount, NULL,NULL, 0, NULL);

The data attribute of this array is grown with 50% overallocation => 0, 4, 8, 14, 23, 36, 56, 86 ..., and shrunk to fit at the end.

The dtype of this array, PyArray_DESCR(ret), apparently has a function that can take value (provided by the iterator next), convert it, and set it in the data.

`(PyArray_DESCR(ret)->f->setitem(value, item, ret)`

In other words, all the dtype conversion is done by the defined dtype. The code would be lot more complicated if it decided 'on the fly' how to convert the value (and all previously allocated ones). Most of the code in this function deals with allocating the data buffer.

I'll hold off on looking up np.array. I'm sure it is much more complex.

198

answered Nov 14 '22 23:11

hpaulj

Related questions
                            
                                Getting 500 INTERNAL SERVER ERROR when unittesting a (flask-restful) GET API Call
                            
                                Numpy Pyinstaller ImportError: cannot import name multiarray
                            
                                Switch to popup in python using selenium
                            
                                Multilayer-perceptron, visualizing decision boundaries (2D) in Python
                            
                                testing: compare numpy arrays while allowing a certain mismatch
                            
                                Using generator expression causes Python to hang
                            
                                input command doesn't seem to work when used with popen python
                            
                                PyQt no button.clicked.connect function?
                            
                                Masked Array: How to change symbol representing masked values [duplicate]
                            
                                Stop pip installing dependancies already installed using apt-get
                            
                                Curious behaviour of Python lists [duplicate]
                            
                                Speed up custom aggregation functions
                            
                                Kafka check queue size
                            
                                Why I don't have permissions to remove six while installing a pip package?
                            
                                How to use Pandas to create Dictionary from column entries in DataFrame or np.array
                            
                                ValueError: cannot set toolkit to wx because it has already been set to qt4
                            
                                Instagram Access Token provided is invalid
                            
                                How to prevent python wheel from expanding shebang?
                            
                                ImportError: No module named matplotlib
                            
                                Default behavior of copy module on user-defined classes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does numpy's fromiter function require specifying the dtype when other array creation routines don't?

Tags:

python

generator

arrays

numpy

memory-efficient