Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala: what is the most appropriate data structure for sorted subsets?

Given a large collection (let's call it 'a') of elements of type T (say, a Vector or List) and an evaluation function 'f' (say, (T) => Double) I would like to derive from 'a' a result collection 'b' that contains the N elements of 'a' that result in the highest value under f. The collection 'a' may contain duplicates. It is not sorted.

Maybe leaving the question of parallelizability (map/reduce etc.) aside for a moment, what would be the appropriate Scala data structure for compiling the result collection 'b'? Thanks for any pointers / ideas.

Notes:

(1) I guess my use case can be most concisely expressed as

val a = Vector( 9,2,6,1,7,5,2,6,9 ) // just an example
val f : (Int)=>Double = (n)=>n      // evaluation function
val b = a.sortBy( f ).take( N )     // sort, then clip

except that I do not want to sort the entire set.

(2) one option might be an iteration over 'a' that fills a TreeSet with 'manual' size bounding (reject anything worse than the worst item in the set, don't let the set grow beyond N). However, I would like to retain duplicates present in the original set in the result set, and so this may not work.

(3) if a sorted multi-set is the right data structure, is there a Scala implementation of this? Or a binary-sorted Vector or Array, if the result set is reasonably small?

like image 444
Gregor Scheidt Avatar asked Oct 17 '11 10:10

Gregor Scheidt


1 Answers

You can use a priority queue:

def firstK[A](xs: Seq[A], k: Int)(implicit ord: Ordering[A]) = {
  val q = new scala.collection.mutable.PriorityQueue[A]()(ord.reverse)
  val (before, after) = xs.splitAt(k)
  q ++= before
  after.foreach(x => q += ord.max(x, q.dequeue))
  q.dequeueAll
}

We fill the queue with the first k elements and then compare each additional element to the head of the queue, swapping as necessary. This works as expected and retains duplicates:

scala> firstK(Vector(9, 2, 6, 1, 7, 5, 2, 6, 9), 4)
res14: scala.collection.mutable.Buffer[Int] = ArrayBuffer(6, 7, 9, 9)

And it doesn't sort the complete list. I've got an Ordering in this implementation, but adapting it to use an evaluation function would be pretty trivial.

like image 130
Travis Brown Avatar answered Sep 19 '22 13:09

Travis Brown