scala implicit performance

Question

This comes up regularly. Functions coded up using generics are signifficnatly slower in scala. See example below. Type specific version performs about a 1/3 faster than the generic version. This is doubly surprising given that the generic component is outside of the expensive loop. Is there a known explanation for this?

  def xxxx_flttn[T](v: Array[Array[T]])(implicit m: Manifest[T]): Array[T] = {
    val I = v.length
    if (I <= 0) Array.ofDim[T](0)
    else {
      val J = v(0).length
      for (i <- 1 until I) if (v(i).length != J) throw new utl_err("2D matrix not symetric. cannot be flattened. first row has " + J + " elements. row " + i + " has " + v(i).length)
      val flt = Array.ofDim[T](I * J)
      for (i <- 0 until I; j <- 0 until J) flt(i * J + j) = v(i)(j)
      flt
    }
  }
  def flttn(v: Array[Array[Double]]): Array[Double] = {
    val I = v.length
    if (I <= 0) Array.ofDim[Double](0)
    else {
      val J = v(0).length
      for (i <- 1 until I) if (v(i).length != J) throw new utl_err("2D matrix not symetric. cannot be flattened. first row has " + J + " elements. row " + i + " has " + v(i).length)
      val flt = Array.ofDim[Double](I * J)
      for (i <- 0 until I; j <- 0 until J) flt(i * J + j) = v(i)(j)
      flt
    }
  }

Rex Kerr · Accepted Answer

You can't really tell what you're measuring here--not very well, anyway--because the for loop isn't as fast as a pure while loop, and the inner operation is quite inexpensive. If we rewrite the code with while loops--the key double-iteration being

 var i = 0
  while (i<I) {
    var j = 0
    while (j<J) {
      flt(i * J + j) = v(i)(j)
      j += 1
    }
    i += 1
  }
  flt

then we see that the bytecode for the generic case is actually dramatically different. Non-generic:

133:    checkcast   #174; //class "[D"
136:    astore  6
138:    iconst_0
139:    istore  5
141:    iload   5
143:    iload_2
144:    if_icmpge   191
147:    iconst_0
148:    istore  4
150:    iload   4
152:    iload_3
153:    if_icmpge   182
// The stuff above implements the loop; now we do the real work
156:    aload   6
158:    iload   5
160:    iload_3
161:    imul
162:    iload   4
164:    iadd
165:    aload_1
166:    iload   5
168:    aaload             // v(i)
169:    iload   4
171:    daload             // v(i)(j)
172:    dastore            // flt(.) = _
173:    iload   4
175:    iconst_1
176:    iadd
177:    istore  4
// Okay, done with the inner work, time to jump around
179:    goto    150
182:    iload   5
184:    iconst_1
185:    iadd
186:    istore  5
188:    goto    141

It's just a bunch of jumps and low-level operations (daload and dastore being the key ones that load and store a double from an array). If we look at the key inner part of the generic bytecode, it instead looks like

160:    getstatic   #30; //Field scala/runtime/ScalaRunTime$.MODULE$:Lscala/runtime/ScalaRunTime$;
163:    aload   7
165:    iload   6
167:    iload   4
169:    imul
170:    iload   5
172:    iadd
173:    getstatic   #30; //Field scala/runtime/ScalaRunTime$.MODULE$:Lscala/runtime/ScalaRunTime$;
176:    aload_1
177:    iload   6
179:    aaload
180:    iload   5
182:    invokevirtual   #107; //Method scala/runtime/ScalaRunTime$.array_apply:(Ljava/lang/Object;I)Ljava/lang/Object;
185:    invokevirtual   #111; //Method scala/runtime/ScalaRunTime$.array_update:(Ljava/lang/Object;ILjava/lang/Object;)V
188:    iload   5
190:    iconst_1
191:    iadd
192:    istore  5

which, as you can see, has to call methods to do the array apply and update. The bytecode for that is a huge mess of stuff like

2:   aload_3 
3:   instanceof      #98; //class "[Ljava/lang/Object;"
6:   ifeq    18
9:   aload_3   
10:  checkcast       #98; //class "[Ljava/lang/Object;"
13:  iload_2
14:  aaload 
15:  goto    183
18:  aload_3
19:  instanceof      #100; //class "[I"
22:  ifeq    37
25:  aload_3   
26:  checkcast       #100; //class "[I"
29:  iload_2
30:  iaload 
31:  invokestatic    #106; //Method scala/runtime/BoxesRunTime.boxToInteger:
34:  goto    183
37:  aload_3
38:  instanceof      #108; //class "[D"
41:  ifeq    56
44:  aload_3   
45:  checkcast       #108; //class "[D"
48:  iload_2
49:  daload 
50:  invokestatic    #112; //Method scala/runtime/BoxesRunTime.boxToDouble:(
53:  goto    183

which basically has to test each type of array and box it if it's the type you're looking for. Double is pretty near the front (3rd of 10), but it's still a pretty major overhead, even if the JVM can recognize that the code ends up being box/unbox and therefore doesn't actually need to allocate memory. (I'm not sure it can do that, but even if it could it wouldn't solve the problem.)

So, what to do? You can try [@specialized T], which will expand your code tenfold for you, as if you wrote each primitive array operation by yourself. Specialization is buggy in 2.9 (should be less so in 2.10), though, so it may not work the way you hope. If speed is of the essence--well, first, write while loops instead of for loops (or at least compile with -optimise which helps for loops out by a factor of two or so!), and then consider either specialization or writing the code by hand for the types you require.

scala implicit performance

Tags:

performance

scala

kardenal.mendoza

1 Answers

Rex Kerr

Recent Activity

Donate For Us

scala implicit performance

Tags:

performance

scala

kardenal.mendoza

1 Answers

Rex Kerr

Related questions

Recent Activity

Donate For Us