On this page it says that methods push!()
and append!()
are very efficient.
My question is how exactly efficient they are?
Namely,
If one knows the size of the final array, is it still faster to preallocate the array or growing it incrementally using append!()
/ push!()
would be just as efficient?
Now consider the case when one does not know the size of the final array. For example, merging multiple arrays into 1 big array (call it A
).
Two ways to achieve this:
append!()
-ing each array to A
, whose size has not been preallocated.A
. Then preallocate A
and copy over contents of each array.Which one would be more efficient in this case?
"Appending" adds to the end of a list, "pushing" adds to the front.
Julia allows adding new elements in an array with the use of push! command. Elements in an array can also be added at a specific index by passing the range of index values in the splice! function.
To initialize an empty array, it is perfectly valid to use n = 0 in the above expressions. A possible source of errors would be to confuse this array with an empty array of “Any” type, which is initialized as follows: v = [] # Same as Any[], and you can't change this type easily later (gotcha!)
The answer to a question like this is usually: "it depends". For example, what size array are you trying to make? What is the element-type of the array?
But if you're just after a heuristic, why not run a simple speed test? For example, the following snippet:
function f1(N::Int)
x = Array(Int, N)
for n = 1:N
x[n] = n
end
return(x)
end
function f2(N::Int)
x = Array(Int, 0)
for n = 1:N
push!(x, n)
end
return(x)
end
f1(2)
f2(2)
N = 5000000000
@time f1(N)
@time f2(N)
suggests that using push!
is about 6 times slower than pre-allocating. If you were using append!
to add larger blocks with less steps, the multiplier would almost certainly be less.
When interpreting these numbers, resist the knee-jerk reaction of "What!? 6-times slower!?". This number needs to be placed in the context of how important array building is to your entire program/function/subroutine. For example, if array building comprises only 1% of the run-time of your routine (for most typical routines, array building would comprise much less than 1%), then if your routine runs for 100 seconds, 1 second is spent building arrays. Multiply that by 6 to get 6 seconds. 99 seconds + 6 seconds = 105 seconds. So, using push!
instead of pre-allocating increases the runtime of your whole program by 5%. Unless you work in high-frequency trading, you're probably not going to care about that.
For myself, my usual rule is this: if I can pre-allocate easily, then I pre-allocate. But if push!
makes the routine much easier to code, with lower possibility of introducing bugs, and less messing around trying to pre-determine the appropriate array size, then I use push!
without a second thought.
Final note: if you want to actually look at the specifics of how push!
works, you'll need to delve into the C routines, since the julia source just wraps a ccall
.
UPDATE: OP questioned in the comments the difference between push!
and an operation like array(end+1) = n
in MATLAB. I haven't coded in MATLAB recently, but I do keep a copy on my machine since the code for all my older papers is in MATLAB. My current version is R2014a. My understanding is that in this version of MATLAB, adding to the end of the array will re-allocate the entire array. In contrast, push!
in Julia works, to the best of my knowledge, much like lists in .NET
. The memory allocated to the vector is dynamically added in blocks as the size of the vector grows. This massively reduces the amount of re-allocation that needs to be performed, although my understanding is that some re-allocation is still necessary (I'm happy to be corrected on this point). So push!
should work much faster than adding to an array in Matlab. So we can run the following MATLAB code:
N = 10000000;
tic
x = ones(N, 1);
for n = 1:N
x(n) = n;
end
toc
N = 10000000;
tic
x = [];
for n = 1:N
x(end+1) = n;
end
toc
I get:
Elapsed time is 0.407288 seconds.
Elapsed time is 1.802845 seconds.
So, about a 5-times slowdown. Given the extreme non-rigour applied in the timing methodology, one might be tempted to say this is equivalent to the Julia case. But wait, if we re-run the exercise in Julia with N = 10000000
, the timings are 0.01 and 0.07 seconds. The sheer difference in the magnitude of these numbers to the MATLAB numbers makes me very nervous about making claims about what is actually happening under the hood, and whether it is legitimate to compare the 5-times slowdown in MATLAB to the 6-times slowdown in Julia. Basically, I'm now out of my depth. Maybe someone who knows more about what MATLAB actually does under the hood can offer more insight. Regarding Julia, I'm not much of a C-coder, so I doubt I'll get much insight from looking through the source (which is publicly available, unlike MATLAB).
push!
will always be slower than inserting into a pre-allocated array, if for no other reason than push!
(1) inserts the element, just as when you do so manually, and (2) increments the length of the array. Two operations cannot be faster than one, when one is part of two.
However, as noted in other answers the gap is often not so large as to something to be concerned about. Internally (last time I checked the code), Julia uses a grow-by-a-factor-of-2 strategy, so that you only need log2(N)
reallocations.
If you do know the size of the array in advance, you can eliminate reallocations by using sizehint!
. As you can easily test for yourself, this does not eliminate the performance penalty relative to insertion into a preallocated array, but it can reduce it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With