Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"growing" (appending to) a sequence object

Tags:

python

matlab

In Matlab, this type of algorithm ("growing arrays") is advised against

mine = []
for i=1:100,
    mine = [mine,randn(1)]
end

whereas it seems that many examples for Python show this kind of algorithm (this is a really bad example though):

import numpy.random as rand

mine = []
for i in range(100):
    mine.append(rand.random(1)[0])

I wonder why that is -- what is the difference?

like image 631
hatmatrix Avatar asked Oct 13 '11 16:10

hatmatrix


4 Answers

The difference is that:

  • In MATLAB, every iteration of your loop re-allocates the matrix to increase the size by one and copies the entire contents into the newly allocated space.
  • Python lists don't work like that. More space is allocated than is needed at any given point and this allocated space grows in a manner that guarantees that appends are done in amortized constant time.

That said, I think the difference is largely cultural:

  • It is common to have large numeric matrices in MATLAB, and growing such matrices one element (or one row/column) at a time would indeed be expensive.
  • On the other hand, no one would use a Python list (or a list of lists) to represent a large matrix: that would be very slow and would make very poor use of memory. Numerical Python's ndarray would be used instead, and ndarray would offer exactly the same tradeoffs as a MATLAB matrix.
like image 87
NPE Avatar answered Nov 07 '22 07:11

NPE


Appending to arrays in Matlab is apparently very inefficient (it runs in quadratic time), whereas in python the corresponding list operation is much more highly optimized. Appends are O(1) up until the list becomes full - at which point the size of the list is doubled to make more room (which is an O(n) operation). This means appends become increasingly efficient as the list grows longer (the overall efficiency is O(1) amortized). These kinds of optimizations could probably also be achieved in Matlab, but it seems they are not done automatically.

For even better performance, python also has the collections.deque container class which supports efficient appending and removal from either end (it's about O(1) in both directions).

like image 35
ekhumoro Avatar answered Nov 07 '22 07:11

ekhumoro


Your two examples are not exactly equivalent. The Matlab example is concatenating two lists into a new list, making a copy each time, whereas the Python is appending items to a list without making a copy of it every time. Now you can in fact write an exact Python equivalent of the Matlab code, e.g.:

mine = mine + [newitem]

But you shouldn't do that, because you're making a copy of an ever-growing list each time. Which is why lists have an .append() method (also .extend()).

For similar reasons, rather than building up a string by concatenation, Pythonistas recommend you append the individual strings to a list then use ``.join() on it.

By the way, Python lists are always allocated with space for extra items, so that they do not always need to be grown when a new item is appended.

like image 2
kindall Avatar answered Nov 07 '22 05:11

kindall


Your MATLAB code can be better written. Compare the following implementations:

%# preallocation
tic
x = zeros(1,100000); for i=1:100000, x(i) = 99; end
toc

%# appending
tic
x = []; for i=1:100000, x(end+1) = 99; end
toc

%# concatenation
tic
x = []; for i=1:100000, x = [x, 99]; end
toc

I get the following results:

Elapsed time is  0.001844 seconds.    %# preallocated vector/matrix
Elapsed time is  0.109597 seconds.    %# appending using "end+1" syntax
Elapsed time is 35.226361 seconds.    %# appending using concatenation

Note that the above was tested on R2011b, which introduced improvements on growing matrices (without pre-allocation).

You should also check this previous answer for a solution that combines preallocation while still allowing for dynamic growing (idea is to allocate/grow in large block sizes)

On the other side, you should note that Python lists are optimized for appending items at the end. If you insert items at the beginning, you will get very different timings. Example:

>>> from timeit import timeit

>>> timeit('x.insert(0,99)', 'x=[]', number=100000)
5.3840245059078597

>>> timeit('x.append(99)', 'x=[]', number=100000)    # x.insert(len(x),99)
0.039047700196533697
like image 2
Amro Avatar answered Nov 07 '22 06:11

Amro