Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MinMaxPriorityQueue using Java streams

I am looking for a memory-efficient way in Java to find top n elements from a huge collection. For instance, I have a word, a distance() method, and a collection of "all" words. I have implemented a class Pair that implements compareTo() so that pairs are sorted by their values.

Using streams, my naive solution looks like this:

double distance(String word1, String word2){
  ...
}

Collection<String> words = ...;
String word = "...";

words.stream()
  .map(w -> new Pair<String, Double>(w, distance(word, w)))
  .sorted()
  .limit(n);

To my understanding, this will process and intermediately store each element in words so that it can be sorted before applying limit(). However, it is more memory-efficient to have a collection that stores n elements and whenever a new element is added, it removes the smallest element (according to the comparable object's natural order) and thus never grows larger than n (or n+1).

This is exactly what the Guava MinMaxPriorityQueue does. Thus, my current best solution to the above problem is this:

Queue<Pair<String, Double>> neighbours = MinMaxPriorityQueue.maximumSize(n).create();
words.stream()
  .forEach(w -> neighbours.add(new Pair<String, Double>(w, distance(word, w)));

The sorting of the top n elements remains to be done after converting the queue to a stream or list, but this is not an issue since n is relatively small.

My question is: is there a way to do the same using streams?

like image 663
Carsten Avatar asked Mar 31 '15 07:03

Carsten


1 Answers

A heap-based structure will of course be more efficient than sorting the entire huge list. Luckily, streams library is perfectly happy to let you use specialized collections when necessary:

MinMaxPriorityQueue<Pair<String, Double>> topN = words.stream()
    .map(w -> new Pair<String, Double>(w, distance(word, w)))
    .collect(toCollection(
            () -> MinMaxPriorityQueue.maximumSize(n).create()
    ));

This is better than the .forEach solution because it's easy to parallelize and is more idiomatic java8.

Note that () -> MinMaxPriorityQueue.maximumSize(n).create() should be possible to be replaced with MinMaxPriorityQueue.maximumSize(n)::create but, for some reason, that won't compile under some conditions (see comments below).

like image 108
Misha Avatar answered Nov 16 '22 13:11

Misha