Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Scala's tail recursion slower than that of Java?

Scala code for simple addition using tail recursion

def add(list : List[Int],sum:Int):Int = {
    //Thread.dumpStack()
    if (list.isEmpty) {
        sum
    } else {
        val headVal  = list.head
        add(list.tail, sum + headVal)
    }
}

Below is the java code for addition in recursive mode.

public static int add(List<Integer> list, Integer sum) {
    // Thread.dumpStack();
    if (list.isEmpty()) {
        return sum;
    } else {
        int headVal = list.remove(0);
        return add(list, sum + headVal);
    }
}

The Java version runs at least 10 times faster. Ran this for 1000 entries. Measured time by using System.nanoTime() API before and after.

Scala version 2.10, java version Java 7. Same JVM properties for both run.

like image 360
RockSolid Avatar asked Mar 07 '15 14:03

RockSolid


People also ask

Why is recursion slow in Java?

In a standard programming language, where the compiler doesn't have tail-recursive optimization, Recursive calls are usually slower than iteration. For instance, in Java, recursive calls are expensive because they can't do a tail-removal optimization.

Is recursion slower than iteration in Java?

Recursion is usually slower than iteration due to the overhead of maintaining the stack. Recursion uses more memory than iteration. Recursion makes the code smaller.

Is tail recursion optimized in Java?

Java doesn't have tail call optimization for the same reason most imperative languages don't have it. Imperative loops are the preferred style of the language, and the programmer can replace tail recursion with imperative loops.

Is tail recursion faster than iteration?

The simplicity of recursion comes at the cost of time and space efficiency. It is much slower than iteration due to the overhead of function calls and control shift from one function to another. It requires extra memory on the stack for each recursive call.


2 Answers

First of all the Scala method add you have shown is not in a (class) context. If you have this method in a class, the tail recursion optimization cannot be applied, since the method is neither final nor private. If you add @tailrec, compilation will fail. If I run it with 10000, it results in a stack overflow.

As for the Java version: The Java version is using a mutable List: The head/tail decomposition alters the underlying list. So after summing you can't use that list any longer, since it is empty.

Further the List in Scala has a completely different meaning than a Java List; the Scala list is meant for head/tail decomposition. As far as I know java.util.List has no tail method, and neither does the Java compiler apply tailrec optimizations, so the comparision is "unfair".

Anyway, I have run some JMH-based tests on different scenarios.

The only two scenarios you can really compare are "Scala while" and "Java for". They neither use OOP nor functional programming, it's just imperative.

The results of five different Scala scenarios

(Please scroll to the right, there is a small description in the last column)

Benchmark                   Mode   Samples         Mean   Mean error    Units
a.b.s.b.Benchmark5a.run    thrpt        10      238,515        7,769   ops/ms Like in the question, but with @tailrec
a.b.s.b.Benchmark5b.run    thrpt        10      130,202        2,232   ops/ms Using List.sum
a.b.s.b.Benchmark5c.run    thrpt        10     2756,132       29,920   ops/ms while (no List, vars, imperative)
a.b.s.b.Benchmark5d.run    thrpt        10      237,286        2,203   ops/ms tailrec version with pattern matching
a.b.s.b.Benchmark5e.run    thrpt        10      214,719        2,483   ops/ms Like in the question (= no tailrec opt)
  • 5a and 5e are like in the question; 5a with @tailrec.
  • 5b: List.sum: Was very slow
  • 5c: Not a big surprise, the imperative version is the fastest one.
  • 5d uses pattern matching instead of if (that would have been "my style"), I am adding the source of this one:
package app.benchmark.scala.benchmark5

import scala.annotation._
import org.openjdk.jmh.annotations.GenerateMicroBenchmark
import org.openjdk.jmh.annotations.Scope
import org.openjdk.jmh.annotations.State
import org.openjdk.jmh.runner.Runner
import org.openjdk.jmh.runner.RunnerException
import org.openjdk.jmh.runner.options.Options
import org.openjdk.jmh.runner.options.OptionsBuilder

@State(Scope.Benchmark)
object BenchmarkState5d {
  val list = List.range(1, 1000)
}

class Benchmark5d {
  private def add(list : List[Int]): Int = {
    @tailrec
    def add(list : List[Int], sum: Int): Int = {
      list match {
        case Nil =>
          sum
        case h :: t =>
          add(t, h + sum)
      }
    }
    add(list, 0)
  }

  @GenerateMicroBenchmark
  def run() = {
    add(BenchmarkState5d.list)
  }
}

The three Java scenarios

Benchmark                   Mode   Samples         Mean   Mean error    Units
a.b.j.b.Benchmark5a.run    thrpt        10       40,437        0,532   ops/ms mutable (rebuilds the list in each iteration)
a.b.j.b.Benchmark5b.run    thrpt        10        0,450        0,008   ops/ms subList
a.b.j.b.Benchmark5c.run    thrpt        10     2735,951       29,177   ops/ms for

If you really want to compare in the meaning of functional programming style (= immutable, tail recursion, head/tail decomposition), than the Java version is five times slower.

As pointed out in a comment by Marko Topolnik:

subList doesn't copy the tail, but it does something comparably bad, when applied to a LinkedList: it wraps the original list and uses an offset to accomodate the semantics. The result is that an O(n) recursive algorithm becomes O(n2)---the same as if the tail was repeatedly copied. Plus, wrappers accrete so you end up with the list wrapped a thousand times. Definitely not comparable to a head/tail list

public class Benchmark5a {
    public static int add(List<Integer> list, Integer sum) {
        if (list.isEmpty()) {
            return sum;
        } else {
            int headVal = list.remove(0);
            return add(list, sum + headVal);
        }
    }

    @GenerateMicroBenchmark
    public long run() {
        final List<Integer> list = new LinkedList<Integer>();
        for(int i = 0; i < 1000; i++) {
            list.add(i);
        }
        return add(list, 0);
    }

    public static void main(String[] args) {
        System.out.println(new Benchmark5a().run());
    }
}

@State(Scope.Benchmark)
class BenchmarkState5b {
    public final static List<Integer> list = new LinkedList<Integer>();

    static {
        for(int i = 0; i < 1000; i++) {
            list.add(i);
        }
    }
}

public class Benchmark5b {
    public static int add(List<Integer> list, int sum) {
        if (list.isEmpty()) {
            return sum;
        } else {
            int headVal = list.get(0);
            return add(list.subList(1, list.size()), sum + headVal);
        }
    }

    @GenerateMicroBenchmark
    public long run() {
        return add(BenchmarkState5b.list, 0);
    }

    public static void main(String[] args) {
        System.out.println(new Benchmark5b().run());
    }
}

Scala results in detail

(all results only show the last scenario, and the overall results)

[...]

# VM invoker: /home/oracle-jdk-1.8-8u40/data/oracle-jdk-1.8.0_40/jre/bin/java
# VM options: <none>
# Fork: 1 of 1
# Warmup: 3 iterations, 1 s each
# Measurement: 10 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: app.benchmark.scala.benchmark5.Benchmark5e.run
# Warmup Iteration   1: 166,153 ops/ms
# Warmup Iteration   2: 215,242 ops/ms
# Warmup Iteration   3: 216,632 ops/ms
Iteration   1: 215,526 ops/ms
Iteration   2: 213,720 ops/ms
Iteration   3: 213,967 ops/ms
Iteration   4: 215,468 ops/ms
Iteration   5: 216,247 ops/ms
Iteration   6: 217,514 ops/ms
Iteration   7: 215,503 ops/ms
Iteration   8: 211,969 ops/ms
Iteration   9: 212,989 ops/ms
Iteration  10: 214,291 ops/ms

Result : 214,719 ±(99.9%) 2,483 ops/ms
  Statistics: (min, avg, max) = (211,969, 214,719, 217,514), stdev = 1,642
  Confidence interval (99.9%): [212,236, 217,202]


Benchmark                   Mode   Samples         Mean   Mean error    Units
a.b.s.b.Benchmark5a.run    thrpt        10      238,515        7,769   ops/ms
a.b.s.b.Benchmark5b.run    thrpt        10      130,202        2,232   ops/ms
a.b.s.b.Benchmark5c.run    thrpt        10     2756,132       29,920   ops/ms
a.b.s.b.Benchmark5d.run    thrpt        10      237,286        2,203   ops/ms
a.b.s.b.Benchmark5e.run    thrpt        10      214,719        2,483   ops/ms

Java results in detail

# VM invoker: /home/oracle-jdk-1.8-8u40/data/oracle-jdk-1.8.0_40/jre/bin/java
# VM options: <none>
# Fork: 1 of 1
# Warmup: 3 iterations, 1 s each
# Measurement: 10 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: app.benchmark.java.benchmark5.Benchmark5c.run
# Warmup Iteration   1: 2777,495 ops/ms
# Warmup Iteration   2: 2888,040 ops/ms
# Warmup Iteration   3: 2692,851 ops/ms
Iteration   1: 2737,169 ops/ms
Iteration   2: 2745,368 ops/ms
Iteration   3: 2754,105 ops/ms
Iteration   4: 2706,131 ops/ms
Iteration   5: 2721,593 ops/ms
Iteration   6: 2769,261 ops/ms
Iteration   7: 2734,461 ops/ms
Iteration   8: 2741,494 ops/ms
Iteration   9: 2740,012 ops/ms
Iteration  10: 2709,915 ops/ms

Result : 2735,951 ±(99.9%) 29,177 ops/ms
  Statistics: (min, avg, max) = (2706,131, 2735,951, 2769,261), stdev = 19,299
  Confidence interval (99.9%): [2706,774, 2765,128]


Benchmark                   Mode   Samples         Mean   Mean error    Units
a.b.j.b.Benchmark5a.run    thrpt        10       40,437        0,532   ops/ms
a.b.j.b.Benchmark5b.run    thrpt        10        0,450        0,008   ops/ms
a.b.j.b.Benchmark5c.run    thrpt        10     2735,951       29,177   ops/ms

Update: Added another Java scenario 5d using ArrayList

Benchmark                   Mode   Samples         Mean   Mean error    Units
a.b.j.b.Benchmark5a.run    thrpt        10       34,931        0,504   ops/ms
a.b.j.b.Benchmark5b.run    thrpt        10        0,430        0,005   ops/ms
a.b.j.b.Benchmark5c.run    thrpt        10     2610,085        9,664   ops/ms
a.b.j.b.Benchmark5d.run    thrpt        10       56,693        1,218   ops/ms
like image 55
Beryllium Avatar answered Sep 19 '22 13:09

Beryllium


You cannot draw any meaningful conclusions about performance from such a short experiment. You might have hit a garbage collection, there might have been some other process hogging your CPU, all classes may not have been loaded and the JVM might be optimizing your code all while your test is running. Any of which would adversely effect test results.

Processing 1000 elements will be extremely fast. You should make your experiment long enough such that inaccuracy in timing measurement or external influences have a smaller effect. Try with a million elements and if that still only take a few seconds try 10 million.

Consider running the test a few times when the JVM starts up before taking the measurement to "warm up" the JVM. You want to ensure that any classes that are lazy loaded have been loaded and that any realtime optimization performed by the JVM is complete before you start your test.

Repeat the experiment a few more times using the longer list of elements. Then throw away the fastest result and the slowest result and average the rest.

like image 41
bhspencer Avatar answered Sep 18 '22 13:09

bhspencer