I am new to Haskell and I am puzzled by the cost of a function call, which seems to be completely unreasonable to me, and makes me think I am doing something fundamentally wrong. Consider the following Haskell code: <pre class="prettyprint"><code>module Main where logistic x = 4.0*x*(1.0-x) lg :: Double -> Int -> Double lg !x 0 = x lg !x !n = lg (logistic x) (n-1) main = putStrLn $ show $ lg 0.7861 100000000 </code></pre> Compiling this with the command <pre class="prettyprint"><code> ghc -O3 -XBangPatterns -o tsths tst.hs </code></pre> and running it, I get: <pre class="prettyprint"><code>real 0m15.904s user 0m15.853s sys 0m0.016s </code></pre> If instead of calling the function <code>logistic</code> I calculate the expression inline: <pre class="prettyprint"><code>module Main where lg :: Double -> Int -> Double lg !x 0 = x lg !x !n = lg (4.0*x*(1.0-x)) (n-1) main = putStrLn $ show $ lg 0.7861 100000000 </code></pre> the execution time becomes: <pre class="prettyprint"><code>real 0m0.838s user 0m0.828s sys 0m0.004s </code></pre> which is exactly the same as the equivalent C program, which is <pre class="prettyprint"><code>#include <stdio.h> int main() { int i, num=100000000; double x=0.7861; for (i=0; i<num; ++i) x *= 4.0*(1.0-x); printf("%lg\n", x); } </code></pre> Am I doing something terribly wrong? Many thanks.

It's a bug in GHC-7.4.1. Looking at the generated core (only the core for the function <code>lg</code> is important, from GHC-7.4.2, we get <pre class="prettyprint"><code>Main.lg3 :: GHC.Types.Double [GblId, Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False, ConLike=False, Cheap=False, Expandable=False, Guidance=IF_ARGS [] 30 0}] Main.lg3 = GHC.Float.$w$cfromRational Main.lg4 Main.lg2 Main.lg1 :: GHC.Types.Double [GblId, Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False, ConLike=False, Cheap=False, Expandable=False, Guidance=IF_ARGS [] 30 0}] Main.lg1 = GHC.Float.$w$cfromRational Main.lg2 Main.lg2 Main.$wlg :: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double# [GblId, Arity=2, Str=DmdType LL, Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=2, Value=True, ConLike=True, Cheap=True, Expandable=True, Guidance=IF_ARGS [0 30] 158 0}] Main.$wlg = \ (ww_s1Oy :: GHC.Prim.Double#) (ww1_s1OC :: GHC.Prim.Int#) -> case ww1_s1OC of ds_Xvs { __DEFAULT -> case Main.lg3 of _ { GHC.Types.D# x_awJ -> case Main.lg1 of _ { GHC.Types.D# x1_awV -> letrec { $wlg1_X1PF [Occ=LoopBreaker] :: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double# [LclId, Arity=2, Str=DmdType LL] $wlg1_X1PF = \ (ww2_X1Pv :: GHC.Prim.Double#) (ww3_X1PA :: GHC.Prim.Int#) -> case ww3_X1PA of ds1_Xwr { __DEFAULT -> $wlg1_X1PF (GHC.Prim.*## (GHC.Prim.*## x_awJ ww2_X1Pv) (GHC.Prim.-## x1_awV ww2_X1Pv)) (GHC.Prim.-# ds1_Xwr 1); 0 -> ww2_X1Pv }; } in $wlg1_X1PF (GHC.Prim.*## (GHC.Prim.*## x_awJ ww_s1Oy) (GHC.Prim.-## x1_awV ww_s1Oy)) (GHC.Prim.-# ds_Xvs 1) } }; 0 -> ww_s1Oy } </code></pre> two top-level <code>Double</code>s and a decent loop. GHC-7.4.1 was a bit too inlining-happy, that produced <pre class="prettyprint"><code>Rec { Main.$wlg [Occ=LoopBreaker] :: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double# [GblId, Arity=2, Str=DmdType LL] Main.$wlg = \ (ww_s1NS :: GHC.Prim.Double#) (ww1_s1NW :: GHC.Prim.Int#) -> case ww1_s1NW of ds_Xvb { __DEFAULT -> case GHC.Float.$wfromRat'' (-1021) 53 Main.logistic4 Main.logistic2 of ww2_a1Mt { __DEFAULT -> case GHC.Float.$wfromRat'' (-1021) 53 Main.logistic2 Main.logistic2 of ww3_X1Nq { __DEFAULT -> Main.$wlg (GHC.Prim.*## (GHC.Prim.*## ww2_a1Mt ww_s1NS) (GHC.Prim.-## ww3_X1Nq ww_s1NS)) (GHC.Prim.-# ds_Xvb 1) } }; 0 -> ww_s1NS } end Rec } </code></pre> and gave you two calls to the <code>fromRational</code> worker in each iteration. Now, <code>fromRational</code> is a fairly complicated function. It is still rather slow, despite having gotten a much faster implementation in the 7.2 series, so these calls hurt big time. With a type signature, there are no <code>Rational</code> top-level constants produced, only <code>Double</code> constants, and these are then used, which of course doesn't include a gratuitous slowdown.

As suggested by Dan Burton it is actually overhead of polymorphic function, because GHC infers type <code>logistic :: Fractional a => a -> a</code>. If you specify type explicitly you generally enable both better checking and better optimizations. I believe it is good practice to specify type of function explicitly. If you want to have function with polymorphic type but have full speed of monomorphic call in case of specific usage you can use <code>SPECIALIZE</code> pragma, but I believe this is GHC specific. <pre class="prettyprint"><code>{-# LANGUAGE BangPatterns #-} module Main where logistic :: Fractional a => a -> a {-# SPECIALISE logistic :: Double -> Double #-} logistic x = 4.0*x*(1.0-x) lg :: Double -> Int -> Double lg !x 0 = x lg !x !n = lg (logistic x) (n-1) main = putStrLn $ show $ lg 0.7861 100000000 </code></pre> Also note that you can specify <code>LANGUAGE</code> pragma at the beginning of file to enable bang patterns and don't need to enable them on command line. Times on my machine were 21 seconds for original, 0.67 sec for explicit type, 0.7 sec for specialize (which is basically the same). I believe overhead of specialized call is very small because it is just bunch of instructions which gets inlined anyway but polymorphic function results in call. Though it is weird that GHC cant inline despite polymorphism.

How much overhead do function calls have in Haskell?

Tags:

performance

haskell

I am new to Haskell and I am puzzled by the cost of a function call, which seems to be completely unreasonable to me, and makes me think I am doing something fundamentally wrong.

Consider the following Haskell code:

module Main where
logistic x = 4.0*x*(1.0-x)

lg :: Double -> Int -> Double
lg !x 0 = x
lg !x !n = lg (logistic x) (n-1)

main = putStrLn $ show $ lg 0.7861 100000000

Compiling this with the command

 ghc -O3 -XBangPatterns -o tsths tst.hs

and running it, I get:

real    0m15.904s
user    0m15.853s
sys     0m0.016s

If instead of calling the function logistic I calculate the expression inline:

module Main where

lg :: Double -> Int -> Double
lg !x 0 = x
lg !x !n = lg (4.0*x*(1.0-x)) (n-1)

main = putStrLn $ show $ lg 0.7861 100000000

the execution time becomes:

real    0m0.838s
user    0m0.828s
sys     0m0.004s

which is exactly the same as the equivalent C program, which is

#include <stdio.h>

int main() {
   int i, num=100000000;
   double x=0.7861;
   for (i=0; i<num; ++i)
      x *= 4.0*(1.0-x);
   printf("%lg\n", x);
}

Am I doing something terribly wrong?

Many thanks.

895

asked Sep 07 '12 15:09

user1654556

3 Answers

It's a bug in GHC-7.4.1. Looking at the generated core (only the core for the function lg is important, from GHC-7.4.2, we get

Main.lg3 :: GHC.Types.Double
[GblId,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False,
         ConLike=False, Cheap=False, Expandable=False,
         Guidance=IF_ARGS [] 30 0}]
Main.lg3 = GHC.Float.$w$cfromRational Main.lg4 Main.lg2

Main.lg1 :: GHC.Types.Double
[GblId,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False,
         ConLike=False, Cheap=False, Expandable=False,
         Guidance=IF_ARGS [] 30 0}]
Main.lg1 = GHC.Float.$w$cfromRational Main.lg2 Main.lg2

Main.$wlg :: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double#
[GblId,
 Arity=2,
 Str=DmdType LL,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=2, Value=True,
         ConLike=True, Cheap=True, Expandable=True,
         Guidance=IF_ARGS [0 30] 158 0}]
Main.$wlg =
  \ (ww_s1Oy :: GHC.Prim.Double#) (ww1_s1OC :: GHC.Prim.Int#) ->
    case ww1_s1OC of ds_Xvs {
      __DEFAULT ->
        case Main.lg3 of _ { GHC.Types.D# x_awJ ->
        case Main.lg1 of _ { GHC.Types.D# x1_awV ->
        letrec {
          $wlg1_X1PF [Occ=LoopBreaker]
            :: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double#
          [LclId, Arity=2, Str=DmdType LL]
          $wlg1_X1PF =
            \ (ww2_X1Pv :: GHC.Prim.Double#) (ww3_X1PA :: GHC.Prim.Int#) ->
              case ww3_X1PA of ds1_Xwr {
                __DEFAULT ->
                  $wlg1_X1PF
                    (GHC.Prim.*##
                       (GHC.Prim.*## x_awJ ww2_X1Pv) (GHC.Prim.-## x1_awV ww2_X1Pv))
                    (GHC.Prim.-# ds1_Xwr 1);
                0 -> ww2_X1Pv
              }; } in
        $wlg1_X1PF
          (GHC.Prim.*##
             (GHC.Prim.*## x_awJ ww_s1Oy) (GHC.Prim.-## x1_awV ww_s1Oy))
          (GHC.Prim.-# ds_Xvs 1)
        }
        };
      0 -> ww_s1Oy
    }

two top-level Doubles and a decent loop.

GHC-7.4.1 was a bit too inlining-happy, that produced

Rec {
Main.$wlg [Occ=LoopBreaker]
  :: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double#
[GblId, Arity=2, Str=DmdType LL]
Main.$wlg =
  \ (ww_s1NS :: GHC.Prim.Double#) (ww1_s1NW :: GHC.Prim.Int#) ->
    case ww1_s1NW of ds_Xvb {
      __DEFAULT ->
        case GHC.Float.$wfromRat'' (-1021) 53 Main.logistic4 Main.logistic2
        of ww2_a1Mt { __DEFAULT ->
        case GHC.Float.$wfromRat'' (-1021) 53 Main.logistic2 Main.logistic2
        of ww3_X1Nq { __DEFAULT ->
        Main.$wlg
          (GHC.Prim.*##
             (GHC.Prim.*## ww2_a1Mt ww_s1NS) (GHC.Prim.-## ww3_X1Nq ww_s1NS))
          (GHC.Prim.-# ds_Xvb 1)
        }
        };
      0 -> ww_s1NS
    }
end Rec }

and gave you two calls to the fromRational worker in each iteration.

Now, fromRational is a fairly complicated function. It is still rather slow, despite having gotten a much faster implementation in the 7.2 series, so these calls hurt big time.

With a type signature, there are no Rational top-level constants produced, only Double constants, and these are then used, which of course doesn't include a gratuitous slowdown.

117

answered Oct 11 '22 21:10

Daniel Fischer

As suggested by Dan Burton it is actually overhead of polymorphic function, because GHC infers type logistic :: Fractional a => a -> a. If you specify type explicitly you generally enable both better checking and better optimizations. I believe it is good practice to specify type of function explicitly.

If you want to have function with polymorphic type but have full speed of monomorphic call in case of specific usage you can use SPECIALIZE pragma, but I believe this is GHC specific.

{-# LANGUAGE BangPatterns #-}
module Main where
logistic :: Fractional a => a -> a
{-# SPECIALISE logistic :: Double -> Double #-}
logistic x = 4.0*x*(1.0-x)

lg :: Double -> Int -> Double
lg !x 0 = x
lg !x !n = lg (logistic x) (n-1)

main = putStrLn $ show $ lg 0.7861 100000000

Also note that you can specify LANGUAGE pragma at the beginning of file to enable bang patterns and don't need to enable them on command line.

Times on my machine were 21 seconds for original, 0.67 sec for explicit type, 0.7 sec for specialize (which is basically the same).

I believe overhead of specialized call is very small because it is just bunch of instructions which gets inlined anyway but polymorphic function results in call. Though it is weird that GHC cant inline despite polymorphism.

answered Oct 11 '22 20:10

Vladimir Still

Add a type signature to logistic and you will see the speedup. Allow me to use CPP to demonstrate the difference.

bash> cat tst.hs
module Main where

#if defined(SIG)
logistic :: Double -> Double
#endif
logistic x = 4.0*x*(1.0-x)

lg :: Double -> Int -> Double
lg !x 0 = x
lg !x !n = lg (logistic x) (n-1)

main = putStrLn $ show $ lg 0.7861 100000000


bash> ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.4.1

If compiled without SIG defined (the type signature is excluded):

bash> ghc -O3 -XBangPatterns -XCPP -o tsths tst.hs
[1 of 1] Compiling Main             ( tst.hs, tst.o )
Linking tsths ...


bash> time ./tsths
0.34209286442469333

real    0m13.187s
user    0m13.177s
sys 0m0.008s

Now lets compile with SIG defined so the signature is included:

bash> rm tsths *.o *.hi


bash> ghc -O3 -XBangPatterns -XCPP -DSIG -o tsths tst.hs
[1 of 1] Compiling Main             ( tst.hs, tst.o )
Linking tsths ...
bash> time ./tsths
0.34209286442469333

real    0m0.464s
user    0m0.440s
sys 0m0.020s

Not sure why GHC doesn't optimize it without the signature; the monomorphism restriction should restrict it to Double -> Double anyways.

answered Oct 11 '22 19:10

Dan Burton

Related questions
                            
                                "Fastest" hash function implemented in Java, comparing part of file
                            
                                Why is printf in F# so slow?
                            
                                Performance Implications of Point-Free style
                            
                                How do I read and parse a text file with numbers, fast (in C)?
                            
                                SQL Server tempdb optimization tips for a new server?
                            
                                Class.forName() caching
                            
                                How to track IIS server performance
                            
                                Are Symfony and CakePHP too slow to be usable?
                            
                                Fastest way to sort in Python
                            
                                How to copy TypedArray into another TypedArray?
                            
                                Should I use one big SQL Select statement or several small ones?
                            
                                Performance question: String.split and then walk on the array, or RegExp?
                            
                                MATLAB parfor is slower than for -- what is wrong?
                            
                                Why is std::accumulate so slow? [duplicate]
                            
                                Slow Swift String Performance
                            
                                What setup does REP do?
                            
                                How to open deep link from mobile browser?
                            
                                Fastest way to zero out a 2D array in C#
                            
                                Pre-allocating a list of None
                            
                                Java Performance measurement

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With