GHC 7.10 generates slower code than older versions

Question

I realized that the latest version of GHC (7.10.3) produces significantly slower code than an older version. My current version as of now:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3

I have also two other old versions installed on my local machine.

My test code is taken from here (the collatz1.hs code):

import Data.Word
import Data.List
import System.Environment

collatzNext :: Word32 -> Word32
collatzNext a = (if even a then a else 3*a+1) `div` 2

-- new code
collatzLen :: Word32 -> Int
collatzLen a0 = lenIterWhile collatzNext (/= 1) a0

lenIterWhile :: (a -> a) -> (a -> Bool) -> a -> Int
lenIterWhile next notDone start = len start 0 where
    len n m = if notDone n
                then len (next n) (m+1)
                else m
-- End of new code

main = do
    [a0] <- getArgs
    let max_a0 = (read a0)::Word32
    print $ maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]

Compiling with GHC 7.4, 7.6 and 7.10 yields the following times:

$ ~/Tools/ghc-7.4.2/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m1.879s
user    0m1.876s
sys     0m0.000s

$ ~/Tools/ghc-7.6.1/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m1.901s
user    0m1.896s
sys     0m0.000s

$ ~/Tools/ghc/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m10.562s
user    0m10.528s
sys     0m0.036s

We can tell there is no doubt that the latest version of GHC produces worse code than the older two versions. I can't reproduce the same efficiency as the blog though probably because I don't have LLVM and Idon't have the exact version the author used. But still, I believe the conclusion is obvious.

My question is, in general, why this could happen? Somehow GHC becomes worse than it used to be. And specifically, if I want to investigate, how should I get myself started?

Zeta · Accepted Answer

Here's a comparison of both profiles (diff Test-GHC-7-8-4.prof Test-GHC-7-10-3.prof)

1c1                               
<       Fri Mar 11 19:58 2016 Time and Allocation Profiling Report  (Final)
---                               
>       Fri Mar 11 19:59 2016 Time and Allocation Profiling Report  (Final)
5,6c5,6                               
<       total time  =        2.40 secs   (2400 ticks @ 1000 us, 1 processor)
<       total alloc = 256,066,744 bytes  (excludes profiling overheads)
---                               
>       total time  =       10.89 secs   (10895 ticks @ 1000 us, 1 processor)
>       total alloc = 15,713,590,808 bytes  (excludes profiling overheads)
10,13c10,12                               
< lenIterWhile.len Main     93.8   0.0                    
< collatzMax       Main      2.2   93.7
< collatzNext      Main      2.0    0.0
< lenIterWhile     Main      1.5    6.2
---                                
> collatzNext      Main     79.6   89.4
> lenIterWhile.len Main     18.9    8.8
> collatzMax       Main      0.8    1.5

There's something very strange going on. While in GHC lenIterWhile.len was taking most of the time, collatzNext is now the culprit. Let's have a look at the dumped core:

-- GHC 7.8.4
Rec {
Main.$wlen [Occ=LoopBreaker]
  :: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType <S,1*U><L,U>]
Main.$wlen =
  \ (ww_s4Mn :: GHC.Prim.Word#) (ww1_s4Mr :: GHC.Prim.Int#) ->
    case ww_s4Mn of wild_XQ {
      __DEFAULT ->
        case GHC.Prim.remWord# wild_XQ (__word 2) of _ [Occ=Dead] {
          __DEFAULT ->
            Main.$wlen
              (GHC.Prim.quotWord#
                 (GHC.Prim.narrow32Word#
                    (GHC.Prim.plusWord#
                       (GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_XQ))
                       (__word 1)))
                 (__word 2))
              (GHC.Prim.+# ww1_s4Mr 1);
          __word 0 ->
            Main.$wlen
              (GHC.Prim.quotWord# wild_XQ (__word 2)) (GHC.Prim.+# ww1_s4Mr 1)
        };
      __word 1 -> ww1_s4Mr
    }
end Rec }

Seems more or less reasonable. Now about GHC 7.10.3:

Rec {
$wlen_r6Sy :: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Str=DmdType <S,U><L,U>]
$wlen_r6Sy =
  \ (ww_s60s :: GHC.Prim.Word#) (ww1_s60w :: GHC.Prim.Int#) ->
    case ww_s60s of wild_X1Z {
      __DEFAULT ->
        case even
               @ Word32 GHC.Word.$fIntegralWord32 (GHC.Word.W32# wild_X1Z)
        of _ [Occ=Dead] {
          False ->
            $wlen_r6Sy
              (GHC.Prim.quotWord#
                 (GHC.Prim.narrow32Word#
                    (GHC.Prim.plusWord#
                       (GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_X1Z))
                       (__word 1)))
                 (__word 2))
              (GHC.Prim.+# ww1_s60w 1);
          True ->
            $wlen_r6Sy
              (GHC.Prim.quotWord# wild_X1Z (__word 2)) (GHC.Prim.+# ww1_s60w 1)
        };
      __word 1 -> ww1_s60w
    }
end Rec }

Allright, seems like it's the same. Except for the call of even. Let's replace even with one of the inline variants of Integral, e.g. x rem 2 == 0:

import Data.Word
import Data.List
import System.Environment

collatzNext :: Word32 -> Word32
collatzNext a = (if a `rem` 2 == 0 then a else 3*a+1) `div` 2

-- rest of code the same

Let's compile it again with profiling and check:

$ stack --resolver=ghc-7.10 ghc -- Test.hs -O2 -fforce-recomp -prof -fprof-auto -auto-all
$ ./Test +RTS -s -p -RTS 
(329,837799)
     416,119,240 bytes allocated in the heap
          69,760 bytes copied during GC
          59,368 bytes maximum residency (2 sample(s))
          21,912 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       800 colls,     0 par    0.000s   0.002s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.000s   0.000s     0.0002s    0.0003s

  INIT    time    0.000s  (  0.019s elapsed)
  MUT     time    2.500s  (  2.546s elapsed)
  GC      time    0.000s  (  0.003s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    2.500s  (  2.567s elapsed)

  %GC     time       0.0%  (0.1% elapsed)

  Alloc rate    166,447,696 bytes per MUT second

  Productivity 100.0% of total user, 97.4% of total elapsed

$ cat Test.prof
        Fri Mar 11 20:22 2016 Time and Allocation Profiling Report  (Final)

           Test.exe +RTS -s -p -RTS 1000000

        total time  =        2.54 secs   (2535 ticks @ 1000 us, 1 processor)
        total alloc = 256,066,984 bytes  (excludes profiling overheads)

COST CENTRE      MODULE  %time %alloc

lenIterWhile.len Main     94.4    0.0
main             Main      1.9   93.7
collatzNext      Main      1.8    0.0
lenIterWhile     Main      1.3    6.2

                                                                   individual     inherited
COST CENTRE           MODULE                     no.     entries  %time %alloc   %time %alloc

MAIN                  MAIN                        44           0    0.0    0.0   100.0  100.0
 main                 Main                        89           0    1.9   93.7   100.0  100.0
  main.\              Main                        92     1000000    0.4    0.0    98.1    6.2
   collatzLen         Main                        93     1000000    0.2    0.0    97.8    6.2
    lenIterWhile      Main                        94     1000000    1.3    6.2    97.5    6.2
     lenIterWhile.len Main                        95    88826840   94.4    0.0    96.2    0.0
      collatzNext     Main                        96    87826840    1.8    0.0     1.8    0.0
  main.max_a0         Main                        90           1    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Encoding.CodePage    73           0    0.0    0.0     0.0    0.0
 CAF                  System.Environment          64           0    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Handle.Text          62           0    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Encoding             61           0    0.0    0.0     0.0    0.0

Seems like that fixed it. So the problem is that GHC-7.8 inlines even, while GHC-7.10 doesn't. This happens due to added {-# SPECIALISE even :: x -> x -> Bool #-} rules for Int and Integer, which don't allow inlining.

As issue's discussion documents making even and odd {-# INLINEABLE ... #-} would resolve this issue. Note that the specialisation itself was added for perfomance reasons.

GHC 7.10 generates slower code than older versions

Tags:

haskell

ghc

Jason Hu

1 Answers

Zeta

Recent Activity

Donate For Us

GHC 7.10 generates slower code than older versions

Tags:

haskell

ghc

Jason Hu

1 Answers

Zeta

Related questions

Recent Activity

Donate For Us