Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of immutable set implementations in Scala

I have recently been diving into Scala and (perhaps predictably) have spent quite a bit of time studying the immutable collections API in the Scala standard library.

I am writing an application that necessarily does many +/- operations on large sets. For this reason, I want to ensure that the implementation I choose is a so-called "persistent" data structure so that I avoid doing copy-on-write. I saw this answer by Martin Odersky, but it didn't really clear up the issue for me.

I wrote the following test code to compare the performance of ListSet and HashSet for add operations:

import scala.collection.immutable._

object TestListSet extends App {
  var set = new ListSet[Int]
  for(i <- 0 to 100000) {
    set += i
  }
}

object TestHashSet extends App {
  var set = new HashSet[Int]
  for(i <- 0 to 100000) {
    set += i
  }
}

Here is a rough runtime measurement of the HashSet:

$ time scala TestHashSet

real    0m0.955s
user    0m1.192s
sys     0m0.147s

And ListSet:

$ time scala TestListSet

real    0m30.516s
user    0m30.612s
sys     0m0.168s

Cons on a singly linked list is a constant-time operation, but this performance looks linear or worse. Is this performance hit related to the need to check each element of the set for object equality to conform to the no-duplicates invariant of Set? If this is the case, I realize it's not related to "persistence".

As for official documentation, all I could find was the following page, but it seems incomplete: Scala 2.8 Collections API -- Performance Characteristics. Since ListSet seems initially to be a good choice for its memory footprint, maybe there should be some information about its performance in the API docs.

like image 888
Connor Doyle Avatar asked Aug 04 '11 19:08

Connor Doyle


People also ask

Is set immutable in Scala?

In Scala, both mutable and immutable sets are available. Mutable set is those set in which the value of the object is change but, in the immutable set, the value of the object is not changed itself. By default set in Scala are immutable. In Scala, the immutable set is defined under Scala.

What is immutable collection in Scala?

Immutable collections, by contrast, never change. You have still operations that simulate additions, removals, or updates, but those operations will in each case return a new collection and leave the old collection unchanged. All collection classes are found in the package scala.

Is Scala seq immutable?

Seq . The 2.13 default Seq is specifically immutable, which has more constraints on it that scala. collection.


2 Answers

An old question but also a good example of conclusions being drawn on the wrong foundation.

Connor, basically you're trying to do a microbenchmark. That is generally not recommended and damn hard to do properly.

Why? Because the JVM is doing many other things than executing the code in your examples. It's loading classes, doing garbage collection, compiling bytecode to native code, etc. All dynamically and based on different metrics sampled at runtime.

So you cannot conclude anything about the performance of the two collections with the above test code. For example, what you could actually be measuring could be the compilation time of the += method of HashSet and garbage collection times of ListSet. So it's a comparison between apples and pears.

To do a micro benchmark properly, you should:

  1. Warm up the JVM: Load all classes, ensure all code paths in the benchmark are run and hot spots in the code are compiled (e.g. the += method).
  2. Run the benchmark and ensure neither the GC or the compiler runs during the test (use the JVM flags -XX:-PrintCompilation and -XX:-PrintGC). If either runs during the test, discard the result.
  3. Repeat step 2 and sample 10-15 good measurements. Calculate variance and standard deviation.
  4. Evaluate: If the mean of each benchmark +/- 3 std do not overlap, then you can draw a conclusion about which is faster. Otherwise, it's a blurry result (depending on the amount of overlap).

I can recommend reading Oracle's recommendations for doing micro benchmarks and a great article about benchmark pitfalls by Brian Goetz.

Also, if you want to use a good tool, which does all the above for you, try Caliper by Google.

like image 138
Nicholas Avatar answered Dec 07 '22 10:12

Nicholas


Since a set has to have with no duplicates, before adding an element, a Set must check to see if it already contains the element. This search in a list that has no guarantee of an element's position will be O(N) linear time. The same general idea applies to its remove operation.

With a HashSet, the class defines a function that picks a location for any element in O(1), which makes the contains(element) method much quicker, at the expense of taking up more space to decrease the chance of element location collisions.

like image 38
Dylan Avatar answered Dec 07 '22 11:12

Dylan