In one of my first attempts to create functional code, I ran into a performance issue.
I started with a common task - multiply the elements of two arrays and sum up the results:
var first:Array[Float] ... var second:Array[Float] ... var sum=0f; for (ix<-0 until first.length) sum += first(ix) * second(ix);
Here is how I reformed the work:
sum = first.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
When I benchmarked the two approaches, the second method takes 40 times as long to complete!
Why does the second method take so much longer? How can I reform the work to be both speed efficient and use functional programming style?
The main reasons why these two examples are so different in speed are:
Let's consider the slower one by parts. First:
first.zip(second)
That creates a new array, an array of Tuple2
. It will copy all elements from both arrays into Tuple2
objects, and then copy a reference to each of these objects into a third array. Now, notice that Tuple2
is parameterized, so it can't store Float
directly. Instead, new instances of java.lang.Float
are created for each number, the numbers are stored in them, and then a reference for each of them is stored into the Tuple2
.
map{ case (a,b) => a*b }
Now a fourth array is created. To compute the values of these elements, it needs to read the reference to the tuple from the third array, read the reference to the java.lang.Float
stored in them, read the numbers, multiply, create a new java.lang.Float
to store the result, and then pass this reference back, which will be de-referenced again to be stored in the array (arrays are not type-erased).
We are not finished, though. Here's the next part:
reduceLeft(_+_)
That one is relatively harmless, except that it still do boxing/unboxing and java.lang.Float
creation at iteration, since reduceLeft
receives a Function2
, which is parameterized.
Scala 2.8 introduces a feature called specialization which will get rid of a lot of these boxing/unboxing. But let's consider alternative faster versions. We could, for instance, do map
and reduceLeft
in a single step:
sum = first.zip(second).foldLeft(0f) { case (a, (b, c)) => a + b * c }
We could use view
(Scala 2.8) or projection
(Scala 2.7) to avoid creating intermediary collections altogether:
sum = first.view.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
This last one doesn't save much, actually, so I think the non-strictness if being "lost" pretty fast (ie, one of these methods is strict even in a view). There's also an alternative way of zipping that is non-strict (ie, avoids some intermediary results) by default:
sum = (first,second).zipped.map{ case (a,b) => a*b }.reduceLeft(_+_)
This gives much better result that the former. Better than the foldLeft
one, though not by much. Unfortunately, we can't combined zipped
with foldLeft
because the former doesn't support the latter.
The last one is the fastest I could get. Faster than that, only with specialization. Now, Function2
happens to be specialized, but for Int
, Long
and Double
. The other primitives were left out, as specialization increases code size rather dramatically for each primitive. On my tests, though Double
is actually taking longer. That might be a result of it being twice the size, or it might be something I'm doing wrong.
So, in the end, the problem is a combination of factors, including producing intermediary copies of elements, and the way Java (JVM) handles primitives and generics. A similar code in Haskell using supercompilation would be equal to anything short of assembler. On the JVM, you have to be aware of the trade-offs and be prepared to optimize critical code.
I did some variations of this with Scala 2.8. The loop version is as you write but the functional version is slightly different:
(xs, ys).zipped map (_ * _) reduceLeft(_ + _)
I ran with Double instead of Float, because currently specialization only kicks in for Double. I then tested with arrays and vectors as the carrier type. Furthermore, I tested Boxed variants which work on java.lang.Double's instead of primitive Doubles to measure the effect of primitive type boxing and unboxing. Here is what I got (running Java 1.6_10 server VM, Scala 2.8 RC1, 5 runs per test).
loopArray 461 437 436 437 435 reduceArray 6573 6544 6718 6828 6554 loopVector 5877 5773 5775 5791 5657 reduceVector 5064 4880 4844 4828 4926 loopArrayBoxed 2627 2551 2569 2537 2546 reduceArrayBoxed 4809 4434 4496 4434 4365 loopVectorBoxed 7577 7450 7456 7463 7432 reduceVectorBoxed 5116 4903 5006 4957 5122
The first thing to notice is that by far the biggest difference is between primitive array loops and primitive array functional reduce. It's about a factor of 15 instead of the 40 you have seen, which reflects improvements in Scala 2.8 over 2.7. Still, primitive array loops are the fastest of all tests whereas primitive array reduces are the slowest. The reason is that primitive Java arrays and generic operations are just not a very good fit. Accessing elements of primitive Java arrays from generic functions requires a lot of boxing/unboxing and sometimes even requires reflection. Future versions of Scala will specialize the Array class and then we should see some improvement. But right now that's what it is.
If you go from arrays to vectors, you notice several things. First, the reduce version is now faster than the imperative loop! This is because vector reduce can make use of efficient bulk operations. Second, vector reduce is faster than array reduce, which illustrates the inherent overhead that arrays of primitive types pose for generic higher-order functions.
If you eliminate the overhead of boxing/unboxing by working only with boxed java.lang.Double values, the picture changes. Now reduce over arrays is a bit less than 2 times slower than looping, instead of the 15 times difference before. That more closely approximates the inherent overhead of the three loops with intermediate data structures instead of the fused loop of the imperative version. Looping over vectors is now by far the slowest solution, whereas reducing over vectors is a little bit slower than reducing over arrays.
So the overall answer is: it depends. If you have tight loops over arrays of primitive values, nothing beats an imperative loop. And there's no problem writing the loops because they are neither longer nor less comprehensible than the functional versions. In all other situations, the FP solution looks competitive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With