The below code (annotated inline with locations) gives a minimal example of the puzzling behavior I'm experiencing.
Essentially, why does (2) result in terrible space/time performance while (1) does not?
The below code is compiled and run as follows on ghc version 8.4.3:
ghc -prof -fprof-auto -rtsopts test.hs; ./test +RTS -p
{-# LANGUAGE Rank2Types #-}
import Debug.Trace
-- Not sure how to get rid of the record
data State = State {
-- (0) If vstate :: Float, the extra "hello"s go away
vstate :: forall a . (Fractional a) => a
}
step :: State -> State
step s =
-- (1) one "hello" per step
-- let vs = trace "hello" (vstate s) in
-- s { vstate = vs `seq` vstate s }
-- (2) increasing "hello"s per step
s { vstate = (trace "hello" (vstate s)) `seq` vstate s }
main :: IO ()
main = do
let initState = State { vstate = 0 }
-- (3) step 3 times
-- let res = step $ step $ step initState
-- print $ vstate res
-- (4) step 20 times to profile time/space performance
let res = iterate step initState
print $ vstate $ last $ take 20 res
print "done"
a. With (1) and (3) commented in, compiled without -O2, the code only outputs "hello" three times, as I expect it to.
b. With (2) and (3) commented in, compiled without -O2, the code outputs "hello" eight times. It seems to output one additional "hello" per step. I don't understand why this is happening.
c. With (1) and (4) commented in, compiled without -O2, the code runs extremely fast.
d. With (2) and (4) commented in, compiled without -O2, the code runs very slowly, and the performance report (included below) shows that makes many more calls to vstate and uses much more memory than variant c. I also don't understand why this is happening.
e. With (2) and (4) commented in, compiled with -O2, the code behaves the same as variant c. So clearly ghc is able to optimize away whatever pathological behavior is happening in variant d.
Here is the profiling report for variant c (fast):
Mon Aug 13 15:48 2018 Time and Allocation Profiling Report (Final)
partial +RTS -p -RTS
total time = 0.00 secs (0 ticks @ 1000 us, 1 processor)
total alloc = 107,560 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
CAF GHC.IO.Handle.FD <entire-module> 0.0 32.3
CAF GHC.IO.Encoding <entire-module> 0.0 3.1
main Main partial.hs:(24,1)-(35,16) 0.0 13.4
main.res Main partial.hs:32:9-36 0.0 1.6
step Main partial.hs:(15,1)-(18,36) 0.0 1.1
step.vs Main partial.hs:17:9-37 0.0 46.1
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 114 0 0.0 0.6 0.0 100.0
CAF Main <entire-module> 227 0 0.0 0.1 0.0 52.2
main Main partial.hs:(24,1)-(35,16) 228 1 0.0 2.7 0.0 52.1
vstate Main partial.hs:11:5-10 230 20 0.0 0.0 0.0 0.0
main.initState Main partial.hs:25:9-40 239 0 0.0 0.0 0.0 0.0
main.res Main partial.hs:32:9-36 234 0 0.0 0.0 0.0 0.0
step Main partial.hs:(15,1)-(18,36) 235 0 0.0 0.0 0.0 0.0
main.initState Main partial.hs:25:9-40 233 1 0.0 0.0 0.0 0.0
main.res Main partial.hs:32:9-36 231 1 0.0 1.6 0.0 49.4
step Main partial.hs:(15,1)-(18,36) 232 19 0.0 1.1 0.0 47.8
step.vs Main partial.hs:17:9-37 236 19 0.0 46.1 0.0 46.7
vstate Main partial.hs:11:5-10 237 190 0.0 0.0 0.0 0.6
main.initState Main partial.hs:25:9-40 238 0 0.0 0.6 0.0 0.6
CAF Debug.Trace <entire-module> 217 0 0.0 0.2 0.0 0.2
CAF GHC.Conc.Signal <entire-module> 206 0 0.0 0.6 0.0 0.6
CAF GHC.IO.Encoding <entire-module> 189 0 0.0 3.1 0.0 3.1
CAF GHC.IO.Encoding.Iconv <entire-module> 187 0 0.0 0.2 0.0 0.2
CAF GHC.IO.Handle.FD <entire-module> 178 0 0.0 32.3 0.0 32.3
CAF GHC.IO.Handle.Text <entire-module> 176 0 0.0 0.1 0.0 0.1
main Main partial.hs:(24,1)-(35,16) 229 0 0.0 10.7 0.0 10.7
Here is the profiling report for variant d (slow; without -O2):
Mon Aug 13 15:25 2018 Time and Allocation Profiling Report (Final)
partial +RTS -p -RTS
total time = 1.48 secs (1480 ticks @ 1000 us, 1 processor)
total alloc = 1,384,174,472 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
step Main partial.hs:(15,1)-(21,60) 95.7 98.8
main.initState Main partial.hs:25:9-40 3.0 1.2
vstate Main partial.hs:11:5-10 1.4 0.0
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 114 0 0.0 0.0 100.0 100.0
CAF Main <entire-module> 227 0 0.0 0.0 100.0 100.0
main Main partial.hs:(24,1)-(35,16) 228 1 0.0 0.0 100.0 100.0
vstate Main partial.hs:11:5-10 230 1048575 1.4 0.0 100.0 100.0
main.initState Main partial.hs:25:9-40 236 0 3.0 1.2 3.0 1.2
main.res Main partial.hs:32:9-36 234 0 0.0 0.0 95.7 98.8
step Main partial.hs:(15,1)-(21,60) 235 0 95.7 98.8 95.7 98.8
main.initState Main partial.hs:25:9-40 233 1 0.0 0.0 0.0 0.0
main.res Main partial.hs:32:9-36 231 1 0.0 0.0 0.0 0.0
step Main partial.hs:(15,1)-(21,60) 232 19 0.0 0.0 0.0 0.0
CAF Debug.Trace <entire-module> 217 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal <entire-module> 206 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding <entire-module> 189 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv <entire-module> 187 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD <entire-module> 178 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.Text <entire-module> 176 0 0.0 0.0 0.0 0.0
main Main partial.hs:(24,1)-(35,16) 229 0 0.0 0.0 0.0 0.0
Here are some notes/guesses/questions on why this is happening:
vstate is monomorphized as vstate :: Float. But I don't see why the lack of a let-binding, as in location (2), would cause such bad time/space performance.realToFrac (the commit is here in case anyone is curious). I know that compiling with -O2 fixes the behavior in the minimal example, but I tried it in the larger codebase and it does not fix the performance issues. (The reason we need rank-2 polymorphism in the larger codebase is to use the ad library for autodiff.) Is there a more principled fix than using realToFrac, such as an inline specialization that can be applied?forall a . (Fractional a) => a is a function type.
It has two arguments, a type (a :: *) and an instance with type Fractional a. Whenever you see =>, it's a function operationally, and compiles to a function in GHC's core representation, and sometimes stays a function in machine code as well. The main difference between -> and => is that arguments for the latter cannot be explicitly given by programmers, and they are always filled in implicitly by instance resolution.
Let's see the fast step first:
step :: State -> State
step (State f) =
let vs = trace "hello" f
in State (vs `seq` f)
Here, vs has an undetermined Fractional type which defaults to Double. If you turn on -Wtype-defaults warning, GHC will point this out to you. Since vs :: Double, it's just a numeric value, which is captured by the returned closure. Right, vs `seq` f is a function, since it has a functional type forall a . (Fractional a) => a, and it's desugared to an actual lambda expression by GHC. This lambda abstracts over the two arguments, captures vs as free variable, and passes along both arguments to f.
So, each step creates a new function closure which captures a vs :: Double. If we call step three times, we get three closures with three Doubles inside, each closure referring to the previous one. Then, when we write vstate (step $ step $ step initState), we again default to Double, and GHC calls this closure with the Fractional Double instance. All the vs-es call previous closures with Fractional Double, but every vs is evaluated only once, because they are regular lazy Double values which are not recomputed.
However, if we enable NoMonomorphismRestriction, vs is generalized to forall a. Fractional a => a, so it becomes a function as well, and its calls are not memoized anymore. Hence, in this case the fast version behaves the same as the slow version.
Now, the slow step:
step :: State -> State
step (State f) = State ((trace "hello" f) `seq` f)
This has exponential number of calls in the number of steps, because step f calls f twice, and without optimizations no computation is shared, because both calls occur under a lambda. In (trace "hello" f) `seq` f, the first call to f defaults to Fractional Double, and the second call just passes along the implicit Fractional a instance as before.
If we switch on optimization, GHC observes that the first f call does not depend on the function parameters, and floats out trace "hello" f to a let-binding, resulting in pretty much the same code as in the fast version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With