Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to fold list in scala, while avoiding allocations and vars

I have a bunch of items in a list, and I need to analyze the content to find out how many of them are "complete". I started out with partition, but then realized that I didn't need to two lists back, so I switched to a fold:

val counts = groupRows.foldLeft( (0,0) )( (pair, row) => 
     if(row.time == 0) (pair._1+1,pair._2) 
     else (pair._1, pair._2+1)
   )

but I have a lot of rows to go through for a lot of parallel users, and it is causing a lot of GC activity (assumption on my part...the GC could be from other things, but I suspect this since I understand it will allocate a new tuple on every item folded).

for the time being, I've rewritten this as

var complete = 0
var incomplete = 0
list.foreach(row => if(row.time != 0) complete += 1 else incomplete += 1)

which fixes the GC, but introduces vars.

I was wondering if there was a way of doing this without using vars while also not abusing the GC?

EDIT:

Hard call on the answers I've received. A var implementation seems to be considerably faster on large lists (like by 40%) than even a tail-recursive optimized version that is more functional but should be equivalent.

The first answer from dhg seems to be on-par with the performance of the tail-recursive one, implying that the size pass is super-efficient...in fact, when optimized it runs very slightly faster than the tail-recursive one on my hardware.

like image 271
Tony K. Avatar asked Dec 01 '22 04:12

Tony K.


2 Answers

The cleanest two-pass solution is probably to just use the built-in count method:

val complete = groupRows.count(_.time == 0)
val counts = (complete, groupRows.size - complete)

But you can do it in one pass if you use partition on an iterator:

val (complete, incomplete) = groupRows.iterator.partition(_.time == 0)
val counts = (complete.size, incomplete.size)

This works because the new returned iterators are linked behind the scenes and calling next on one will cause it to move the original iterator forward until it finds a matching element, but it remembers the non-matching elements for the other iterator so that they don't need to be recomputed.


Example of the one-pass solution:

scala> val groupRows = List(Row(0), Row(1), Row(1), Row(0), Row(0)).view.map{x => println(x); x}
scala> val (complete, incomplete) = groupRows.iterator.partition(_.time == 0)
Row(0)
Row(1)
complete: Iterator[Row] = non-empty iterator
incomplete: Iterator[Row] = non-empty iterator
scala> val counts = (complete.size, incomplete.size)
Row(1)
Row(0)
Row(0)
counts: (Int, Int) = (3,2)
like image 89
dhg Avatar answered Dec 05 '22 04:12

dhg


I see you've already accepted an answer, but you rightly mention that that solution will traverse the list twice. The way to do it efficiently is with recursion.

def counts(xs: List[...], complete: Int = 0, incomplete: Int = 0): (Int,Int) = 
  xs match {
    case Nil => (complete, incomplete)
    case row :: tail => 
      if (row.time == 0) counts(tail, complete + 1, incomplete)
      else               counts(tail, complete, incomplete + 1)
  }

This is effectively just a customized fold, except we use 2 accumulators which are just Ints (primitives) instead of tuples (reference types). It should also be just as efficient a while-loop with vars - in fact, the bytecode should be identical.

like image 41
Luigi Plinge Avatar answered Dec 05 '22 05:12

Luigi Plinge