Scala code for simple addition using tail recursion
def add(list : List[Int],sum:Int):Int = {
//Thread.dumpStack()
if (list.isEmpty) {
sum
} else {
val headVal = list.head
add(list.tail, sum + headVal)
}
}
Below is the java code for addition in recursive mode.
public static int add(List<Integer> list, Integer sum) {
// Thread.dumpStack();
if (list.isEmpty()) {
return sum;
} else {
int headVal = list.remove(0);
return add(list, sum + headVal);
}
}
The Java version runs at least 10 times faster. Ran this for 1000 entries. Measured time by using System.nanoTime()
API before and after.
Scala version 2.10, java version Java 7. Same JVM properties for both run.
In a standard programming language, where the compiler doesn't have tail-recursive optimization, Recursive calls are usually slower than iteration. For instance, in Java, recursive calls are expensive because they can't do a tail-removal optimization.
Recursion is usually slower than iteration due to the overhead of maintaining the stack. Recursion uses more memory than iteration. Recursion makes the code smaller.
Java doesn't have tail call optimization for the same reason most imperative languages don't have it. Imperative loops are the preferred style of the language, and the programmer can replace tail recursion with imperative loops.
The simplicity of recursion comes at the cost of time and space efficiency. It is much slower than iteration due to the overhead of function calls and control shift from one function to another. It requires extra memory on the stack for each recursive call.
First of all the Scala method add
you have shown is not in a (class) context. If you have this method in a class, the tail recursion optimization cannot be applied, since the method is neither final
nor private
. If you add @tailrec
, compilation will fail. If I run it with 10000, it results in a stack overflow.
As for the Java version: The Java version is using a mutable List
: The head/tail decomposition alters the underlying list.
So after summing you can't use that list any longer, since it is empty.
Further the List
in Scala has a completely different meaning than a Java List
; the Scala list is meant for head/tail decomposition. As far as I know java.util.List has no tail method, and neither does the Java compiler apply tailrec optimizations, so the comparision is "unfair".
Anyway, I have run some JMH-based tests on different scenarios.
The only two scenarios you can really compare are "Scala while" and "Java for". They neither use OOP nor functional programming, it's just imperative.
The results of five different Scala scenarios
(Please scroll to the right, there is a small description in the last column)
Benchmark Mode Samples Mean Mean error Units
a.b.s.b.Benchmark5a.run thrpt 10 238,515 7,769 ops/ms Like in the question, but with @tailrec
a.b.s.b.Benchmark5b.run thrpt 10 130,202 2,232 ops/ms Using List.sum
a.b.s.b.Benchmark5c.run thrpt 10 2756,132 29,920 ops/ms while (no List, vars, imperative)
a.b.s.b.Benchmark5d.run thrpt 10 237,286 2,203 ops/ms tailrec version with pattern matching
a.b.s.b.Benchmark5e.run thrpt 10 214,719 2,483 ops/ms Like in the question (= no tailrec opt)
@tailrec
. List.sum
: Was very slowif
(that would have been "my style"), I am adding the source of this one:package app.benchmark.scala.benchmark5
import scala.annotation._
import org.openjdk.jmh.annotations.GenerateMicroBenchmark
import org.openjdk.jmh.annotations.Scope
import org.openjdk.jmh.annotations.State
import org.openjdk.jmh.runner.Runner
import org.openjdk.jmh.runner.RunnerException
import org.openjdk.jmh.runner.options.Options
import org.openjdk.jmh.runner.options.OptionsBuilder
@State(Scope.Benchmark)
object BenchmarkState5d {
val list = List.range(1, 1000)
}
class Benchmark5d {
private def add(list : List[Int]): Int = {
@tailrec
def add(list : List[Int], sum: Int): Int = {
list match {
case Nil =>
sum
case h :: t =>
add(t, h + sum)
}
}
add(list, 0)
}
@GenerateMicroBenchmark
def run() = {
add(BenchmarkState5d.list)
}
}
The three Java scenarios
Benchmark Mode Samples Mean Mean error Units
a.b.j.b.Benchmark5a.run thrpt 10 40,437 0,532 ops/ms mutable (rebuilds the list in each iteration)
a.b.j.b.Benchmark5b.run thrpt 10 0,450 0,008 ops/ms subList
a.b.j.b.Benchmark5c.run thrpt 10 2735,951 29,177 ops/ms for
If you really want to compare in the meaning of functional programming style (= immutable, tail recursion, head/tail decomposition), than the Java version is five times slower.
As pointed out in a comment by Marko Topolnik:
subList
doesn't copy the tail, but it does something comparably bad, when applied to aLinkedList
: it wraps the original list and uses an offset to accomodate the semantics. The result is that an O(n) recursive algorithm becomes O(n2)---the same as if the tail was repeatedly copied. Plus, wrappers accrete so you end up with the list wrapped a thousand times. Definitely not comparable to a head/tail list
public class Benchmark5a {
public static int add(List<Integer> list, Integer sum) {
if (list.isEmpty()) {
return sum;
} else {
int headVal = list.remove(0);
return add(list, sum + headVal);
}
}
@GenerateMicroBenchmark
public long run() {
final List<Integer> list = new LinkedList<Integer>();
for(int i = 0; i < 1000; i++) {
list.add(i);
}
return add(list, 0);
}
public static void main(String[] args) {
System.out.println(new Benchmark5a().run());
}
}
@State(Scope.Benchmark)
class BenchmarkState5b {
public final static List<Integer> list = new LinkedList<Integer>();
static {
for(int i = 0; i < 1000; i++) {
list.add(i);
}
}
}
public class Benchmark5b {
public static int add(List<Integer> list, int sum) {
if (list.isEmpty()) {
return sum;
} else {
int headVal = list.get(0);
return add(list.subList(1, list.size()), sum + headVal);
}
}
@GenerateMicroBenchmark
public long run() {
return add(BenchmarkState5b.list, 0);
}
public static void main(String[] args) {
System.out.println(new Benchmark5b().run());
}
}
Scala results in detail
(all results only show the last scenario, and the overall results)
[...]
# VM invoker: /home/oracle-jdk-1.8-8u40/data/oracle-jdk-1.8.0_40/jre/bin/java
# VM options: <none>
# Fork: 1 of 1
# Warmup: 3 iterations, 1 s each
# Measurement: 10 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: app.benchmark.scala.benchmark5.Benchmark5e.run
# Warmup Iteration 1: 166,153 ops/ms
# Warmup Iteration 2: 215,242 ops/ms
# Warmup Iteration 3: 216,632 ops/ms
Iteration 1: 215,526 ops/ms
Iteration 2: 213,720 ops/ms
Iteration 3: 213,967 ops/ms
Iteration 4: 215,468 ops/ms
Iteration 5: 216,247 ops/ms
Iteration 6: 217,514 ops/ms
Iteration 7: 215,503 ops/ms
Iteration 8: 211,969 ops/ms
Iteration 9: 212,989 ops/ms
Iteration 10: 214,291 ops/ms
Result : 214,719 ±(99.9%) 2,483 ops/ms
Statistics: (min, avg, max) = (211,969, 214,719, 217,514), stdev = 1,642
Confidence interval (99.9%): [212,236, 217,202]
Benchmark Mode Samples Mean Mean error Units
a.b.s.b.Benchmark5a.run thrpt 10 238,515 7,769 ops/ms
a.b.s.b.Benchmark5b.run thrpt 10 130,202 2,232 ops/ms
a.b.s.b.Benchmark5c.run thrpt 10 2756,132 29,920 ops/ms
a.b.s.b.Benchmark5d.run thrpt 10 237,286 2,203 ops/ms
a.b.s.b.Benchmark5e.run thrpt 10 214,719 2,483 ops/ms
Java results in detail
# VM invoker: /home/oracle-jdk-1.8-8u40/data/oracle-jdk-1.8.0_40/jre/bin/java
# VM options: <none>
# Fork: 1 of 1
# Warmup: 3 iterations, 1 s each
# Measurement: 10 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: app.benchmark.java.benchmark5.Benchmark5c.run
# Warmup Iteration 1: 2777,495 ops/ms
# Warmup Iteration 2: 2888,040 ops/ms
# Warmup Iteration 3: 2692,851 ops/ms
Iteration 1: 2737,169 ops/ms
Iteration 2: 2745,368 ops/ms
Iteration 3: 2754,105 ops/ms
Iteration 4: 2706,131 ops/ms
Iteration 5: 2721,593 ops/ms
Iteration 6: 2769,261 ops/ms
Iteration 7: 2734,461 ops/ms
Iteration 8: 2741,494 ops/ms
Iteration 9: 2740,012 ops/ms
Iteration 10: 2709,915 ops/ms
Result : 2735,951 ±(99.9%) 29,177 ops/ms
Statistics: (min, avg, max) = (2706,131, 2735,951, 2769,261), stdev = 19,299
Confidence interval (99.9%): [2706,774, 2765,128]
Benchmark Mode Samples Mean Mean error Units
a.b.j.b.Benchmark5a.run thrpt 10 40,437 0,532 ops/ms
a.b.j.b.Benchmark5b.run thrpt 10 0,450 0,008 ops/ms
a.b.j.b.Benchmark5c.run thrpt 10 2735,951 29,177 ops/ms
Update: Added another Java scenario 5d using ArrayList
Benchmark Mode Samples Mean Mean error Units
a.b.j.b.Benchmark5a.run thrpt 10 34,931 0,504 ops/ms
a.b.j.b.Benchmark5b.run thrpt 10 0,430 0,005 ops/ms
a.b.j.b.Benchmark5c.run thrpt 10 2610,085 9,664 ops/ms
a.b.j.b.Benchmark5d.run thrpt 10 56,693 1,218 ops/ms
You cannot draw any meaningful conclusions about performance from such a short experiment. You might have hit a garbage collection, there might have been some other process hogging your CPU, all classes may not have been loaded and the JVM might be optimizing your code all while your test is running. Any of which would adversely effect test results.
Processing 1000 elements will be extremely fast. You should make your experiment long enough such that inaccuracy in timing measurement or external influences have a smaller effect. Try with a million elements and if that still only take a few seconds try 10 million.
Consider running the test a few times when the JVM starts up before taking the measurement to "warm up" the JVM. You want to ensure that any classes that are lazy loaded have been loaded and that any realtime optimization performed by the JVM is complete before you start your test.
Repeat the experiment a few more times using the longer list of elements. Then throw away the fastest result and the slowest result and average the rest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With