Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a Scala parallel collection from a Java collection

The easiest way to convert a Java Collection to a Scala equivalent is using JavaConversions, since Scala 2.8.. These implicit defs return wrappers for the contained Java Collection.

Scala 2.9 introduced parallel collections, where operations on a collection can be executed in parallel and the result collected later. This is easily implemented, converting an existing collection into a parallel one is as simple as:

myCollection.par

But there's a problem with using 'par' on collections converted from Java collections using JavaConversions. As described in Parallel Collection Conversions, inherently sequential collections are 'forced' into a new parallel collection by evaluating all of the values and adding them to the new parallel collection:

Other collections, such as lists, queues or streams, are inherently sequential in the sense that the elements must be accessed one after the other. These collections are converted to their parallel variants by copying the elements into a similar parallel collection. For example, a functional list is converted into a standard immutable parallel sequence, which is a parallel vector.

This causes problems when the original Java collection is intended to be lazily evaluated. For instance, if only a Java Iterable is returned, later converted to a Scala Iterable, there's no guarantee that the contents of the Iterable are intended to be accessed eagerly or not. So how should a parallel collection be created from a Java collection without sustaining the cost of evaluating each element? It is this cost I am trying to avoid by using a parallel collection to execute them in parallel and hopefully 'take' the first n results that are offered.

According to Parallel Collection Conversions there are a series of collection types that cost constant time, but there doesn't appear to be a way of getting a guarantee that these types can be created by JavaConversions (e.g. 'Set' can be created, but is that a 'HashSet'?).

like image 985
Dan Gravell Avatar asked Oct 12 '12 12:10

Dan Gravell


Video Answer


1 Answers

First, every collection obtained via JavaConversions from a Java collection is not a by-default parallelizable Scala collection - this means that it will always be reevaluated into its corresponding parallel collection implementation. The reason for this is that parallel execution relies on the concepts of Splitters at least - it has to be splittable into smaller subsets that different processors can then work on.

I don't know how your Java collection looks in the data-structure sense, but if it's a tree-like thing or an array underneath whose elements are evaluated lazily, chances are that you can easily implement a Splitter.

If you do not want to eagerly force a lazy collection that implements a Java collection API, then your only option is to implement a new type of a parallel collection for that particular lazy Java collection. In this new implementation you have to provide means of splitting the iterator (that is, a Splitter).

Once you implement this new parallel collection which knows how to split your data-structure, you should create a custom Scala wrapper for your specific Java collection (at this point it's just a little bit of extra boilerplate, see how it's done in JavaConversions) and override its par to return your specific parallel collection.

You might even be able to do this generically for indexed sequences. Given that your Java collection is a sequence (in Java, a List) with a particularly efficient get method, you could implement the Splitter as an iterator that calls get within the initial range from 0 to size - 1, and is split by subdividing this range.

If you do, patches to the Standard library are always welcome.

like image 132
axel22 Avatar answered Oct 23 '22 10:10

axel22