Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient groupwise aggregation on Scala collections

I often need to do something like

coll.groupBy(f(_)).mapValues(_.foldLeft(x)(g(_,_)))

What is the best way to achieve the same effect, but avoid explicitly constructing the intermediate collections with groupBy?

like image 508
Daniel Mahler Avatar asked Mar 20 '13 16:03

Daniel Mahler


2 Answers

You could fold the initial collection over a map holding your intermediate results:

def groupFold[A,B,X](as: Iterable[A], f: A => B, init: X, g: (X,A) => X): Map[B,X] = 
  as.foldLeft(Map[B,X]().withDefaultValue(init)){
    case (m,a) => {
      val key = f(a)
      m.updated(key, g(m(key),a))
    }
  }

You said collection and I wrote Iterable, but you have to think whether order matters in the fold in your question.

If you want efficient code, you will probably use a mutable map as in Rex' answer.

like image 90
ziggystar Avatar answered Nov 15 '22 04:11

ziggystar


You can't really do it as a one-liner, so you should be sure you need it before writing something more elaborate like this (written from a somewhat performance-minded view since you asked for "efficient"):

final case class Var[A](var value: A) { }
def multifold[A,B,C](xs: Traversable[A])(f: A => B)(zero: C)(g: (C,A) => C) = {
  import scala.collection.JavaConverters._
  val m = new java.util.HashMap[B, Var[C]]
  xs.foreach{ x =>
    val v = { 
      val fx = f(x)
      val op = m.get(fx)
      if (op != null) op
      else { val nv = Var(zero); m.put(fx, nv); nv }
    }
    v.value = g(v.value, x)
  }
  m.asScala.mapValues(_.value)
}

(Depending on your use case you may wish to pack into an immutable map instead in the last step.) Here's an example of it in action:

scala> multifold(List("salmon","herring","haddock"))(_(0))(0)(_ + _.length)
res1: scala.collection.mutable.HashMap[Char,Int] = Map(h -> 14, s -> 6)        

Now, you might notice something weird here: I'm using a Java HashMap. This is because Java's HashMaps are 2-3x faster than Scala's. (You can write the equivalent thing with a Scala HashMap, but it doesn't actually make things any faster than your original.) Consequently, this operation is 2-3x faster than what you posted. But unless you're under severe memory pressure, creating the transient collections doesn't actually hurt you much.

like image 28
Rex Kerr Avatar answered Nov 15 '22 03:11

Rex Kerr