Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell: unnecessary reevaluations of constant expressions

I am going to demonstrate the problem using the following example program

{-# LANGUAGE BangPatterns #-}

data Point = Point !Double !Double

fmod :: Double -> Double -> Double
fmod a b | a < 0     = b - fmod (abs a) b 
         | otherwise = if a < b then a 
                       else let q = a / b 
                            in b * (q - fromIntegral (floor q :: Int))

standardMap :: Double -> Point -> Point
standardMap k (Point q p) = 
   Point (fmod (q + p) (2 * pi)) (fmod (p + k * sin(q)) (2 * pi))

iterate' gen !p = p : (iterate' gen $ gen p)

main = putStrLn 
     . show 
     . (\(Point a b) -> a + b) 
     . head . drop 100000000 
     . iterate' (standardMap k) $ (Point 0.15 0.25)
    where k = (cos (pi/3)) - (sin (pi/3))

Here standardMap k is the parametrized function and k=(cos (pi/3))-(sin (pi/3)) is a parameter. If i compile this program with ghc -O3 -fllvm the execution time on my machine is approximately 42s, however, if I write k in the form 0.5 - (sin (pi/3)) the execution time equals 21s and if I write k = 0.5 - 0.5 * (sqrt 3) it will take only 12s.

The conclusion is that k is reevaluated on each call of standardMap k.

Why this is not optimized?

P.S. compiler ghc 7.6.3 on archlinux

EDIT

For those who are concerned with the weird properties of standardMap here is a simpler and more intuitive example, which exhibits the same problem

{-# LANGUAGE BangPatterns #-}

data Point = Point !Double !Double

rotate :: Double -> Point -> Point
rotate k (Point q p) = 
   Point ((cos k) * q - (sin k) * p) ((sin k) * q + (cos k) * p)

iterate' gen !p = p : (iterate' gen $ gen p)

main = putStrLn 
     . show 
     . (\(Point a b) -> a + b) 
     . head . drop 100000000 
     . iterate' (rotate k) $ (Point 0.15 0.25)
   where --k = (cos (pi/3)) - (sin (pi/3))
         k = 0.5 - 0.5 * (sqrt 3)

EDIT

Before I asked the question I have tried to make k strict, the same way Don suggested, but with ghc -O3 I didn't see a difference. The solution with strictness works if the program is compiled with ghc -O2. I missed that because I didn't try all possible combinations of flags with the all possible versions of the program.

So what is the difference between -O3 and -O2 that affects such cases?

Should I prefer -O2 in general?

EDIT

As observed by Mike Hartl and others, if rotate k is changed into rotate $ k or standardMap k into standardMap $ k, the performance is improved, though it is not the best possible (Don's solution). Why?

like image 556
Benjamin Batistic Avatar asked Sep 06 '13 10:09

Benjamin Batistic


2 Answers

As always, check the core.

With ghc -O2, k is inlined into the loop body, which is floated out as a top level function:

Main.main7 :: Main.Point -> Main.Point
Main.main7 =
  \ (ds_dAa :: Main.Point) ->
    case ds_dAa of _ { Main.Point q_alG p_alH ->
    case q_alG of _ { GHC.Types.D# x_s1bt ->
    case p_alH of _ { GHC.Types.D# y_s1bw ->
    case Main.$wfmod (GHC.Prim.+## x_s1bt y_s1bw) 6.283185307179586
    of ww_s1bi { __DEFAULT ->
    case Main.$wfmod
           (GHC.Prim.+##
              y_s1bw
              (GHC.Prim.*##
                 (GHC.Prim.-##
                    (GHC.Prim.cosDouble# 1.0471975511965976)
                    (GHC.Prim.sinDouble# 1.0471975511965976))
                 (GHC.Prim.sinDouble# x_s1bt)))
           6.283185307179586
    of ww1_X1bZ { __DEFAULT ->
    Main.Point (GHC.Types.D# ww_s1bi) (GHC.Types.D# ww1_X1bZ)

Indicating that the sin and cos calls aren't evaluated at compile time. The result is that a bit more math is going to occur:

$ time ./A
3.1430515093368085
real    0m15.590s

If you make it strict, it is at least not recalculated each time:

main = putStrLn
     . show
     . (\(Point a b) -> a + b)
     . head . drop 100000000
     . iterate' (standardMap k) $ (Point 0.15 0.25)

    where
      k :: Double
      !k = (cos (pi/3)) - (sin (pi/3))

Resulting in:

ipv_sEq =
                      GHC.Prim.-##
                        (GHC.Prim.cosDouble# 1.0471975511965976)
                        (GHC.Prim.sinDouble# 1.0471975511965976) } in

And a running time of:

$ time ./A
6.283185307179588
real    0m7.859s

Which I think is good enough for now. I'd also add unpack pragmas to the Point type.

If you want to reason about numeric performance under different code arrangements, you must inspect the Core.


Using your revised example. It suffers the same issue. k is inlined rotate. GHC thinks it is really cheap, when in this benchmark it is more expensive.

Naively, ghc-7.2.3 -O2

$ time ./A
0.1470480616244365

real    0m22.897s

And k is evaluated each time rotate is called.

Make k strict: one way to force it to be not shared.

$ time ./A
0.14704806100839019

real    0m2.360s

Using UNPACK pragmas on the Point constructor:

$ time ./A
0.14704806100839019

real    0m1.860s
like image 118
Don Stewart Avatar answered Nov 11 '22 08:11

Don Stewart


I don't think it is repeated evaluation.

First, I switched to "do" notation and used a "let" on the definition of "k" which I figured should help. No - still slow.

Then I added a trace call - just being evaluated once. Even checked that the fast variant was in fact producing a Double.

Then I printed out both variations. There is a small difference in the starting values.

Tweaking the value of the "slow" variant makes it the same speed. I've no idea what your algorithm is for - would it be very sensitive to starting values?

import Debug.Trace (trace)

...

main = do
    -- is -0.3660254037844386
    let k0 = (0.5 - 0.5 * (sqrt 3))::Double
    -- was -0.3660254037844385
    let k1 = (cos (pi/3)) - (trace "x" (sin (pi/3))) + 0.0000000000000001;
    putStrLn (show k0)
    putStrLn (show k1)
    putStrLn
     . show
     . (\(Point a b) -> a + b)
     . head . drop 100000000
     . iterate' (standardMap k1) $ (Point 0.15 0.25)

EDIT: this is the version with numeric literals. It's displaying runtimes of 23sec vs 7sec for me. I compiled two separate versions of the code to make sure I wasn't doing something stupid like not recompiling.

main = do
    -- -0.3660254037844386
    -- -0.3660254037844385
    let k2 = -0.3660254037844385
    putStrLn
     . show
     . (\(Point a b) -> a + b)
     . head . drop 100000000
     . iterate' (standardMap k2) $ (Point 0.15 0.25)

EDIT2: I don't know how to get the opcodes from ghc, but comparing the hexdumps for the two .o files shows they differ by a single byte - presumably the literal. So it can't be the runtime.


EDIT3: Tried turning profiling on, and that's just puzzled me even more. unless I'm missing something the only difference is a small discrepancy in the number of calls to fmod (fmod.q to be precise).

The "5" profile is for the constant ending "5", same with "6".

        Fri Sep  6 12:37 2013 Time and Allocation Profiling Report  (Final)

           constant-timings-5 +RTS -p -RTS

        total time  =       38.34 secs   (38343 ticks @ 1000 us, 1 processor)
        total alloc = 12,000,105,184 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

standardMap Main     71.0    0.0
iterate'    Main     21.2   93.3
fmod        Main      6.3    6.7


                                                          individual     inherited
COST CENTRE     MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN            MAIN                     50           0    0.0    0.0   100.0  100.0                                  
 main           Main                    101           0    0.0    0.0     0.0    0.0                                  
 CAF:main1      Main                     98           0    0.0    0.0     0.0    0.0                                  
  main          Main                    100           1    0.0    0.0     0.0    0.0                                  
 CAF:main2      Main                     97           0    0.0    0.0     1.0    0.0                                  
  main          Main                    102           0    1.0    0.0     1.0    0.0                                  
   main.\       Main                    110           1    0.0    0.0     0.0    0.0                                  
 CAF:main3      Main                     96           0    0.0    0.0    99.0  100.0                                  
  main          Main                    103           0    0.0    0.0    99.0  100.0                                  
   iterate'     Main                    104   100000001   21.2   93.3    99.0  100.0                                  
    standardMap Main                    105   100000000   71.0    0.0    77.9    6.7                                  
     fmod       Main                    106   200000001    6.3    6.7     6.9    6.7                                                                                                
      fmod.q    Main                    109    49999750    0.6    0.0     0.6    0.0                                                                                                
 CAF:main_k     Main                     95           0    0.0    0.0     0.0    0.0                                                                                                
  main          Main                    107           0    0.0    0.0     0.0    0.0                                                                                                
   main.k2      Main                    108           1    0.0    0.0     0.0    0.0                                                                                                
 CAF            GHC.IO.Handle.FD         93           0    0.0    0.0     0.0    0.0                                                                                                
 CAF            GHC.Conc.Signal          90           0    0.0    0.0     0.0    0.0                                                                                                
 CAF            GHC.Float                89           0    0.0    0.0     0.0    0.0                                                                                                
 CAF            GHC.IO.Encoding          82           0    0.0    0.0     0.0    0.0                                                                                                
 CAF            GHC.IO.Encoding.Iconv    66           0    0.0    0.0     0.0    0.0 


        Fri Sep  6 12:38 2013 Time and Allocation Profiling Report  (Final)

           constant-timings-6 +RTS -p -RTS

        total time  =       22.17 secs   (22167 ticks @ 1000 us, 1 processor)
        total alloc = 11,999,947,752 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

standardMap Main     48.4    0.0
iterate'    Main     38.2   93.3
fmod        Main     10.9    6.7
main        Main      1.4    0.0
fmod.q      Main      1.0    0.0


                                                          individual     inherited
COST CENTRE     MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN            MAIN                     50           0    0.0    0.0   100.0  100.0
 main           Main                    101           0    0.0    0.0     0.0    0.0
 CAF:main1      Main                     98           0    0.0    0.0     0.0    0.0
  main          Main                    100           1    0.0    0.0     0.0    0.0
 CAF:main2      Main                     97           0    0.0    0.0     1.4    0.0
  main          Main                    102           0    1.4    0.0     1.4    0.0
   main.\       Main                    110           1    0.0    0.0     0.0    0.0
 CAF:main3      Main                     96           0    0.0    0.0    98.6  100.0
  main          Main                    103           0    0.0    0.0    98.6  100.0
   iterate'     Main                    104   100000001   38.2   93.3    98.6  100.0
    standardMap Main                    105   100000000   48.4    0.0    60.4    6.7
     fmod       Main                    106   200000001   10.9    6.7    12.0    6.7
      fmod.q    Main                    109    49989901    1.0    0.0     1.0    0.0
 CAF:main_k     Main                     95           0    0.0    0.0     0.0    0.0
  main          Main                    107           0    0.0    0.0     0.0    0.0
   main.k2      Main                    108           1    0.0    0.0     0.0    0.0
 CAF            GHC.IO.Handle.FD         93           0    0.0    0.0     0.0    0.0
 CAF            GHC.Conc.Signal          90           0    0.0    0.0     0.0    0.0
 CAF            GHC.Float                89           0    0.0    0.0     0.0    0.0
 CAF            GHC.IO.Encoding          82           0    0.0    0.0     0.0    0.0
 CAF            GHC.IO.Encoding.Iconv    66           0    0.0    0.0     0.0    0.0

EDIT4: Link below is to the two opcode dumps (thanks to @Tom Ellis). Although I can't read them, they seem to have the same "shape". Presumably the long random-char strings are internal identifiers. I've just recompiled both with -O2 -fforce-recomp and the time differences are real. https://gist.github.com/anonymous/6462797

like image 41
Richard Huxton Avatar answered Nov 11 '22 10:11

Richard Huxton