Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell - Optimizing differential equation solver

I'm learning Haskell and am trying to write code as fast as I can do in C. For this exercise, I'm writing a Euler integrator for a simple one-dimensional physical system.

  • The C code is compiled with GCC 4.5.4 and -O3. It runs in 1.166 seconds.
  • The Haskell code is compiled with GHC 7.4.1 and -O3. It runs in 21.3 seconds.
  • If I compile Haskell with -O3 -fllvm, it runs in 4.022 seconds.

So, am I missing something to optimize my Haskell code?

PS.: I used the following arguments: 1e-8 5.

C code:

#include <stdio.h>
double p, v, a, t;

double func(double t) {
  return t * t;
}

void euler(double dt) {
  double nt = t + dt;
  double na = func(nt);
  double nv = v + na * dt;
  double np = p + nv * dt;

  p = np;
  v = nv;
  a = na;
  t = nt;
}

int main(int argc, char ** argv) {
  double dt, limit;
  sscanf(argv[1], "%lf", &dt);
  sscanf(argv[2], "%lf", &limit);
  p = 0.0;
  v = 0.0;
  a = 0.0;
  t = 0.0;

  while(t < limit) euler(dt);
  printf("%f %f %f %f\n", p, v, a, t);
  return 0;
}

Haskell Code:

import System.Environment (getArgs)

data EulerState = EulerState !Double !Double !Double !Double deriving(Show)
type EulerFunction = Double -> Double

main = do
  [dt, l] <- fmap (map read) getArgs
  print $ runEuler (EulerState 0 0 0 0) (**2) dt l

runEuler :: EulerState -> EulerFunction -> Double -> Double -> EulerState
runEuler s@(EulerState _ _ _ t) f dt limit = let s' = euler s f dt
                                             in case t `compare` limit of
                                                  LT -> s' `seq` runEuler s' f dt limit
                                                  _ -> s'

euler :: EulerState -> EulerFunction -> Double -> EulerState
euler (EulerState p v a t) f dt = (EulerState p' v' a' t')
    where t' = t + dt
          a' = f t'
          v' = v + a'*dt
          p' = p + v'*dt
like image 706
Charles Welton Avatar asked Oct 11 '12 02:10

Charles Welton


4 Answers

The key points have already been mentioned by hammar and by Philip JF. But let me collect them and add a little bit of explanation nevertheless.

I will proceed from top to bottom.

data EulerState = EulerState !Double !Double !Double !Double

Your type has strict fields, so whenever a object of that type is evaluated to WHNF, its fields are also evaluated to WHNF. In this case, that means the object is completely evaluated. That's good, but in our case not good enough, unfortunately. Objects of this type can still be constructed using pointers to the raw data instead of unpacking the raw data into the constructor, and that is what happens to the acceleration field (modulo the fact that the loop doesn't directly use the type, but passes the components as arguments). Since that field is not used in euler, you get

Rec {
Main.$wrunEuler [Occ=LoopBreaker]
  :: GHC.Prim.Double#
     -> GHC.Prim.Double#
     -> GHC.Types.Double
     -> GHC.Prim.Double#
     -> Main.EulerFunction
     -> GHC.Prim.Double#
     -> GHC.Prim.Double#
     -> (# GHC.Types.Double,
           GHC.Types.Double,
           GHC.Types.Double,
           GHC.Types.Double #)

a loop with a boxed argument there. That means that in each iteration, some Double#s need to be boxed, and some Doubles unboxed. Boxing and unboxing aren't very expensive operations,but in a loop that would otherwise be tight, they can cost a lot of performance. A further instance of the same boxing/unboxing problem is connected to the argument of type EulerFunction, more on that later. -funbox-strict-fields as suggested by Philp JF, or an {-# UNPACK #-} pragma on at least the acceleration field helps here, but the difference only becomes relevant when the boxing/unboxing for the function evaluation is eliminated too.

print $ runEuler (EulerState 0 0 0 0) (**2) dt l

You're passing (** 2) here as an argument. That's not the same function as you use in C, the corresponding C function would be return pow(t,2);, and with my gcc, using that nearly doubles the running time for the C programme (no difference for clang, though). The big performance problem with that is that (**) is a slow function. Since (** 2) has different results from \x -> x*x for many arguments, there is no rewrite rule for that, so you really get that slow function with GHC's native code generator (it seems that LLVM replaces it with \x -> x*x nevertheless from the huge performance difference of the two GHC backends and the clang result). If you pass (\x -> x*x) or (^ 2) there instead of (** 2), you get the multiplication (there is a rewrite rule for (^ 2) since 7.4). At this point, on my system, there is no huge difference between the performance of the NCG-generated code and the LLVM-generated, but the NCG is about 10% faster.

Now the huge problem

runEuler :: EulerState -> EulerFunction -> Double -> Double -> EulerState
runEuler s@(EulerState _ _ _ t) f dt limit = let s' = euler s f dt
                                             in case t `compare` limit of
                                                  LT -> s' `seq` runEuler s' f dt limit
                                                  _ -> s'

runEuler is recursive, hence it cannot be inlined. That means the passed function cannot be inlined there either, and the dt and limit arguments are passed for each iteration too. That the function cannot be inlined means that in the loop, its argument must be boxed before being passed to the function, and its result must then be unboxed. That is expensive. And it means that no optimisations that could be made after inlining the function argument can ever be made.

If you make the worker/wrapper transformation and static argument transformation suggested by hammar, runEuler can be inlined, hence the passed function can be inlined and - in this case - the boxing of the argument, and unboxing of its result can be eliminated. Furthermore, and of even bigger impact, in this case the function call can be eliminated and replaced with one machine operation. That leads to a nice tight loop, as illustrated by

       174,208 bytes allocated in the heap
         3,800 bytes copied during GC

in comparison to the

16,000,174,912 bytes allocated in the heap
     1,475,432 bytes copied during GC

of the original.

Together, that achieves about half the speed of the C programme with the native code generator, and the same speed with the LLVM backend on my box (the native code generator is not particularly good at optimising loops, while LLVM is, since loops are very common in many languages compiled via LLVM).

like image 130
Daniel Fischer Avatar answered Nov 15 '22 03:11

Daniel Fischer


I got a nice boost by applying a worker-wrapper transformation to runEuler.

runEuler :: EulerState -> EulerFunction -> Double -> Double -> EulerState
runEuler s f dt limit = go s
  where go s@(EulerState _ _ _ t) = if t < limit then go (euler s f dt) else s

This helps f get inlined into the loop (which probably also happens in the C version), getting rid of a lot of overhead.

like image 34
hammar Avatar answered Nov 15 '22 02:11

hammar


I don't have a functioning LLVM at the moment, but I get within a factor of two by

  1. Using -O2 instead of -O3 in GHC (although I doubt it matters, -O3 is not maintained)
  2. Using -funbox-strict-fields
  3. Using x*x instead of x ** 2 (same as your C code)
  4. Moving your "euler function" to be an independent function the same way it is in C.

ie

  func :: EulerFunction
  func x = x * x

  runEuler :: EulerState -> Double -> Double -> EulerState
  runEuler s@(EulerState _ _ _ t) dt limit = let s' = euler s dt
                                             in case t `compare` limit of
                                                  LT -> s' `seq` runEuler s' dt limit
                                                  _ -> s'

  euler :: EulerState -> Double -> EulerState
  euler (EulerState p v a t) dt = (EulerState p' v' a' t')
    where t' = t + dt
          a' = func t'
          v' = v + a'*dt
          p' = p + v'*dt

you can probably push it further (or perhaps a Haskell performance expert like Dons will show up with a solution), I have not even looked at the core this generates, but in general the way to make Haskell code "as fast as C" is "write it in C and use the FFI."

like image 5
Philip JF Avatar answered Nov 15 '22 02:11

Philip JF


Few references:

  • http://www.haskell.org/haskellwiki/Performance
  • http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell.html
  • http://donsbot.wordpress.com/2010/03/01/evolving-faster-haskell-programs-now-with-llvm/

Below is evangelism representing common folklore. So take it with a grain of salt.

You cannot get stable C-like performance across different microbenchmarks from anything less mature than prehistoric languages such as C, Fortran, Ada and C++. Even Java is not quite there yet. Sometimes you get, but sometimes compiler fails, and GHC fails quite often.

But microbenchmarks don't tell you everything.

The problem is that getting fine tuned low level C code everywhere is not financially feasible for large projects. So C programs end up having poor algorithms, poor architecture, no low-level bottlenecks and plans to eventually rewrite everything. In C it's easy to tune low level code but painfully hard to do large scale architectural changes. In Haskell it's vise versa, so writing in a mixture of haskell and C should give you best of both worlds.

like image 4
nponeccop Avatar answered Nov 15 '22 03:11

nponeccop