Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallelize Scala's Iterator

Please note: This is not a duplicate question, since this question specifies on all methods Iterator has, not just map and flatMap. Therefore Future.traverse are not a good answer.

Let's say I have this simple statement:

(1 to 100).toSet.subsets.find(f)

It works perfectly. It's lazy, doesn't use a lot of memory and returns as soon as one element is found. The problem starts when you want to parallelize it. You might say, it's Scala, there has to be .par or Iterator, but there isn't.

The proposed solution on the internet is to use .grouped, but it's not as good as I'd want. Why?

val it = (1 to 100).toSet.subsets.grouped(1000000).map(_.par.find(f)).flatten
if (it.hasNext) Some(it.next) else None
  1. Uses much more memory. I know it's still O(1), but let's be perfect here :)

  2. It's not perfectly parallelizable (by Amdahl's law). When .grouped is consuming the iterator for next block of million elements, all but one thread waits. This is especially problematic if iterator is expensive to consume. Moreover, there has to be an overhead of spawning a new set of threads to work on a new block.

  3. Produces more complicated / longer code (see example). It would shorten code a bit if Iterator had .nextOption, but still.

Is there anything else despite programming my own producer-consumer model (iterator is producer, threads are consumers) and then final reduce step?

like image 249
Rok Kralj Avatar asked May 20 '14 09:05

Rok Kralj


1 Answers

You can use .toStream. This will produce a lazy stream that will memoize values. And it has .par on it.

It will allocate some wrappers on the heap but if you're careful(not hold around pointers to the stream) this will only result in GC pressure but not in increasing residual memory footprint. It will still go quite fast. Please mind that parallel collections induce quite a lot of overhead and might not be worth it if your per-element computation is not expensive enough.

Iterators are just too low-level to parallelize. But you don't actually need a parallel iterator but rather a parallel traversal of the iterator you can have Future.traverse from the standard library.

like image 187
edofic Avatar answered Sep 17 '22 15:09

edofic