Scala 2.9 introduced parallel collections. They are a really great tool for certain tasks. However, how do they work internally and am I able to influence the behavior/configuration?
What method do they use to figure out the optimal number of threads? If I am not satisfied with the result are there any configuration parameters to adjust?
I'm not only interested how many threads are actually created, I am also interested in the way how the actual work is distributed amongst them. How the results are collected and how much magic is going on behind the scenes. Does Scala somehow test if a collection is large enough to benefit from parallel processing?
Briefly, there are two orthogonal aspects to how your operations are parallelized:
map
or filter
)For #2, this is managed by the pool itself, which discovers the "ideal" level of parallelism at runtime (see java.lang.Runtime.getRuntime.availableProcessors
)
For #1, this is a separate problem and the scala parallel collections API does this via the concept of work-stealing (adaptive scheduling). That is, when a particular piece of work is done, a worker will attempt to steal work from other work-queues. If none is available, this is an indication that all of the processors are very busy and hence a bigger chunk of work should be taken.
Aleksandar Prokopec, who implemented the library gave a talk at this year's ScalaDays which will be online shortly. He also gave a great talk at ScalaDays2010 where he describes in detail how the operations are split and re-joined (there are a number of issues that are not immediately obvious and some lovely bits of cleverness in there too!).
A more comprehensive answer is available in the PDF describing the parallel collections API.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With