Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the quickest way to iterate through a numpy array

I noticed a meaningful difference between iterating through a numpy array "directly" versus iterating through via the tolist method. See timing below:

directly
[i for i in np.arange(10000000)]
via tolist
[i for i in np.arange(10000000).tolist()]

enter image description here


considering I've discovered one way to go faster. I wanted to ask what else might make it go faster?

what is fastest way to iterate through a numpy array?

like image 214
piRSquared Avatar asked Nov 14 '16 16:11

piRSquared


People also ask

How can I make NumPy array faster?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.

Can I iterate through NumPy array?

Iterating Arrays Iterating means going through elements one by one. As we deal with multi-dimensional arrays in numpy, we can do this using basic for loop of python. If we iterate on a 1-D array it will go through each element one by one.

What is faster than NumPy?

pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files). If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.

Which is faster NumPy array or list?

As predicted, we can see that NumPy arrays are significantly faster than lists.


2 Answers

This is actually not surprising. Let's examine the methods one a time starting with the slowest.

[i for i in np.arange(10000000)]

This method asks python to reach into the numpy array (stored in the C memory scope), one element at a time, allocate a Python object in memory, and create a pointer to that object in the list. Each time you pipe between the numpy array stored in the C backend and pull it into pure python, there is an overhead cost. This method adds in that cost 10,000,000 times.

Next:

[i for i in np.arange(10000000).tolist()]

In this case, using .tolist() makes a single call to the numpy C backend and allocates all of the elements in one shot to a list. You then are using python to iterate over that list.

Finally:

list(np.arange(10000000))

This basically does the same thing as above, but it creates a list of numpy's native type objects (e.g. np.int64). Using list(np.arange(10000000)) and np.arange(10000000).tolist() should be about the same time.


So, in terms of iteration, the primary advantage of using numpy is that you don't need to iterate. Operation are applied in an vectorized fashion over the array. Iteration just slows it down. If you find yourself iterating over array elements, you should look into finding a way to restructure the algorithm you are attempting, in such a way that is uses only numpy operations (it has soooo many built-in!) or if really necessary you can use np.apply_along_axis, np.apply_over_axis, or np.vectorize.

like image 163
James Avatar answered Oct 23 '22 05:10

James


These are my timings on a slower machine

In [1034]: timeit [i for i in np.arange(10000000)]
1 loop, best of 3: 2.16 s per loop

If I generate the range directly (Py3 so this is a genertor) times are much better. Take this a baseline for a list comprehension of this size.

In [1035]: timeit [i for i in range(10000000)]
1 loop, best of 3: 1.26 s per loop

tolist converts the arange to a list first; takes a bit longer, but the iteration is still on a list

In [1036]: timeit [i for i in np.arange(10000000).tolist()]
1 loop, best of 3: 1.6 s per loop

Using list() - same time as direct iteration on the array; that suggests that the direct iteration first does this.

In [1037]: timeit [i for i in list(np.arange(10000000))]
1 loop, best of 3: 2.18 s per loop

In [1038]: timeit np.arange(10000000).tolist()
1 loop, best of 3: 927 ms per loop

same times a iterating on the .tolist

In [1039]: timeit list(np.arange(10000000))
1 loop, best of 3: 1.55 s per loop

In general if you must loop, working on a list is faster. Access to elements of a list is simpler.

Look at the elements returned by indexing.

a[0] is another numpy object; it is constructed from the values in a, but not simply a fetched value

list(a)[0] is the same type; the list is just [a[0], a[1], a[2]]]

In [1043]: a = np.arange(3)
In [1044]: type(a[0])
Out[1044]: numpy.int32
In [1045]: ll=list(a)
In [1046]: type(ll[0])
Out[1046]: numpy.int32

but tolist converts the array into a pure list, in this case, as list of ints. It does more work than list(), but does it in compiled code.

In [1047]: ll=a.tolist()
In [1048]: type(ll[0])
Out[1048]: int

In general don't use list(anarray). It rarely does anything useful, and is not as powerful as tolist().

What's the fastest way to iterate through array - None. At least not in Python; in c code there are fast ways.

a.tolist() is the fastest, vectorized way of creating a list integers from an array. It iterates, but does so in compiled code.

But what is your real goal?

like image 33
hpaulj Avatar answered Oct 23 '22 04:10

hpaulj