Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Haskell optimize function calls the same way Clang / GCC does?

I want to ask you if Haskell and C++ compilers can optimize function calls the same way. Please look at following codes. In the following example Haskell is significantly faster than C++.

I have heard that Haskell can compile to LLVM and can be optimized by the LLVM passes. Additionally I have heard that Haskell has some heavy optimizations under the hood. But the following examples should be able to work with the same performance. I want to ask:

  1. Why my sample benchmark in C++ is slower than the on in Haskell?
  2. is it possible to further optimize the codes?

(I'm using LLVM-3.2 and GHC-7.6).

C++ code:

#include <cstdio>
#include <cstdlib>

int b(const int x){
    return x+5;
}

int c(const int x){
    return b(x)+1;
}

int d(const int x){
    return b(x)-1;
}

int a(const int x){
    return c(x) + d(x);
}

int main(int argc, char* argv[]){
    printf("Starting...\n");
    long int iternum = atol(argv[1]);
    long long int out = 0;
    for(long int i=1; i<=iternum;i++){
        out += a(iternum-i);
    }
    printf("%lld\n",out);
    printf("Done.\n");
}

compiled with clang++ -O3 main.cpp

haskell code:

module Main where
import qualified Data.Vector as V
import System.Environment
b :: Int -> Int
b x = x + 5
c x = b x + 1
d x = b x - 1
a x = c x + d x
main = do
   putStrLn "Starting..."
   args <- getArgs
   let iternum = read (head args) :: Int in do
      putStrLn $ show $ V.foldl' (+) 0 $ V.map (\i -> a (iternum-i))
         $ V.enumFromTo 1 iternum
      putStrLn "Done."

compiled with ghc -O3 --make -fforce-recomp -fllvm ghc-test.hs

speed results:

enter image description here

Running testcase for program 'cpp/a.out'
-------------------
cpp/a.out 100000000      0.0%   avg time: 105.05 ms 
cpp/a.out 200000000      11.11% avg time: 207.49 ms 
cpp/a.out 300000000      22.22% avg time: 309.22 ms 
cpp/a.out 400000000      33.33% avg time: 411.7 ms 
cpp/a.out 500000000      44.44% avg time: 514.07 ms 
cpp/a.out 600000000      55.56% avg time: 616.7 ms 
cpp/a.out 700000000      66.67% avg time: 718.69 ms
cpp/a.out 800000000      77.78% avg time: 821.32 ms 
cpp/a.out 900000000      88.89% avg time: 923.18 ms 
cpp/a.out 1000000000     100.0% avg time: 1025.43 ms


Running testcase for program 'hs/main'
-------------------
hs/main 100000000    0.0%   avg time: 70.97 ms (diff: 34.08)
hs/main 200000000    11.11% avg time: 138.95 ms (diff: 68.54)
hs/main 300000000    22.22% avg time: 206.3 ms (diff: 102.92)
hs/main 400000000    33.33% avg time: 274.31 ms (diff: 137.39)
hs/main 500000000    44.44% avg time: 342.34 ms (diff: 171.73)
hs/main 600000000    55.56% avg time: 410.65 ms (diff: 206.05)
hs/main 700000000    66.67% avg time: 478.25 ms (diff: 240.44)
hs/main 800000000    77.78% avg time: 546.39 ms (diff: 274.93)
hs/main 900000000    88.89% avg time: 614.12 ms (diff: 309.06)
hs/main 1000000000   100.0% avg time: 682.32 ms (diff: 343.11)

EDIT Of course we cannot compare speed of languages, but the speed of implementiations.

But I'm curious if Ghc and C++ compilers can optimize function calls the same way

I've edited the question with new benchmark and codes based on your help :)

like image 463
Wojciech Danilo Avatar asked Jun 27 '13 13:06

Wojciech Danilo


2 Answers

If your goal is to get this running as quickly as your C++ compiler, then you would want to use a data structure that the compiler can have its way with.

module Main where

import qualified Data.Vector as V

b :: Int -> Int
b x = x + 5
c x = b x + 1
d x = b x - 1

a x = c x + d x

main = do
    putStrLn "Starting..."
    putStrLn $ show $ V.foldl' (+) 0 $ V.map a $ V.enumFromTo 1 100000000
    putStrLn "Done."

GHC is able to completely eliminate the loop and just inserts a constant into the resulting assembly. On my computer, this now has a runtime of < 0.002s, when using the same optimization flags as you originally specified.

As a follow up based on the comments by @Yuras, the core produced by the vector based solution and the stream-fusion solution are functionally identical.

Vector

main_$s$wfoldlM'_loop [Occ=LoopBreaker]
  :: Int# -> Int# -> Int#

main_$s$wfoldlM'_loop =
  \ (sc_s2hW :: Int#) (sc1_s2hX :: Int#) ->
    case <=# sc1_s2hX 100000000 of _ {
      False -> sc_s2hW;
      True ->
        main_$s$wfoldlM'_loop
          (+#
             sc_s2hW
             (+#
                (+# (+# sc1_s2hX 5) 1)
                (-# (+# sc1_s2hX 5) 1)))
          (+# sc1_s2hX 1)
    }

stream-fusion

$wloop_foldl [Occ=LoopBreaker]
  :: Int# -> Int# -> Int#

$wloop_foldl =
  \ (ww_s1Rm :: Int#) (ww1_s1Rs :: Int#) ->
    case ># ww1_s1Rs 100000000 of _ {
      False ->
        $wloop_foldl
          (+#
             ww_s1Rm
             (+#
                (+# (+# ww1_s1Rs 5) 1)
                (-# (+# ww1_s1Rs 5) 1)))
          (+# ww1_s1Rs 1);
      True -> ww_s1Rm
    }

The only real difference is the choice of comparison operation for the termination condition. Both versions compile to tight tail recursive loops that can be easily optimized by LLVM.

like image 155
sabauma Avatar answered Oct 10 '22 16:10

sabauma


ghc doesn't fuse lists (avoiding success at all costs?)

Here is version that uses stream-fusion package:

module Main where

import Prelude hiding (map, foldl)
import Data.List.Stream
import Data.Stream (enumFromToInt, unstream)
import Text.Printf
import Control.Exception
import System.CPUTime

b :: Int -> Int
b x = x + 5
c x = b x + 1
d x = b x - 1

a x = c x + d x

main = do
    putStrLn "Starting..."
    putStrLn $ show $ foldl (+) 0 $ map (\z -> a z) $ unstream $ enumFromToInt 1 100000000 
    putStrLn "Done."

I don't have llvm installed to compare with your results, but it is 10x faster then your version (compiled without llvm).

I think vector fusion should perform even faster.

like image 25
Yuras Avatar answered Oct 10 '22 15:10

Yuras