Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do Scala parallel collections sometimes cause an OutOfMemoryError?

This takes around 1 second

(1 to 1000000).map(_+3)

While this gives java.lang.OutOfMemoryError: Java heap space

(1 to 1000000).par.map(_+3)

EDIT:

I have standard scala 2.9.2 configuration. I am typing this on scala prompt. And in the bash i can see [ -n "$JAVA_OPTS" ] || JAVA_OPTS="-Xmx256M -Xms32M"

AND i dont have JAVA_OPTS set in my env.

1 million integers = 8MB, creating list twice = 16MB

like image 641
FUD Avatar asked Jun 01 '12 09:06

FUD


2 Answers

It seems definitely related to the JVM memory option and to the memory required to stock a Parralel collection. For example:

scala> (1 to 1000000).par.map(_+3)

ends up with a OutOfMemoryError the third time I tried to evaluate it, while

scala> (1 to 1000000).par.map(_+3).seq

never failed. The issue is not the computation its the storage of the Parrallel collection.

like image 78
Nicolas Avatar answered Nov 02 '22 23:11

Nicolas


Several reasons for the failure:

  1. Parallel collections are not specialized, so the objects get boxed. This means that you can't multiply the number of elements with 8 to get the memory usage.
  2. Using map means that the range is converted into a vector. For parallel vectors an efficient concatenation has not been implemented yet, so merging intermediate vectors produced by different processors proceeds by copying - requiring more memory. This will be addressed in future releases.
  3. The REPL stores previous results - the object evaluated in each line remains in memory.
like image 45
axel22 Avatar answered Nov 02 '22 22:11

axel22