Why does Haskell perform so poorly when executing C-like codes? (in this instance at least)

Question

I am trying to figure out some performance issues I have been having with Haskell. As part of that, I have written a small comparison program to compare C and Haskell. Specifically, I translated the C program to Haskell with as few changes as I could. The speed-measured part of the Haskell program is then written in a very imperative style.

The program makes two list of random numbers in some range, and then calculates the integral of the graph formed by simply connecting those points, with one list being x-values and one list being y-values. Essentially, it is the trapezoidal rule.

Here are the two codes:

main.c

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 5000000
#define maxY 1e5f/N
#define maxXgap 1

int main(){
    int i;
    float *y, *x;
    float xaccum, area;
    clock_t begin, end;
    double time_spent;

    y = (float*)malloc(sizeof(float)*N);
    x = (float*)malloc(sizeof(float)*N);

    srand(50546345); // change seed for different numbers

    //populate y and x fields with random points
    for(i = 0; i < N; i++){
        y[i] = ((float)rand())/((float)RAND_MAX)*maxY;
    }
    xaccum = 0;
    for(i = 0; i < N; i++){
        x[i] = xaccum;
        xaccum += ((float)rand())/((float)RAND_MAX)*maxXgap;
    }
    begin = clock();
    //perform a trapezoidal integration using the x y coordinates
    area = 0;
    for(i = 0; i < N-1; i++){
        area += (y[i+1]+y[i])/2*(x[i+1]-x[i]);
    }
    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC * 1000;
    printf("%i points
%f area
%f ms
", N, area, time_spent);
}

Main.hs

{-# LANGUAGE BangPatterns #-}
module Main where

import Data.Array.Unboxed
import Data.Array.IO
import Data.List
import System.Random
import System.CPUTime
import Text.Printf
import Control.Exception

main :: IO ()
main = do
          (x,y) <- initArrays
          area <- time $ integrate x y
          print area

n :: Int
n = 5000000

maxY :: Float
maxY = 100000.0/(fromIntegral n)

maxXgap :: Float
maxXgap = 1

--initialize arrays with random floats
--this part is not measured in the running time (very slow)
initArrays :: IO (IOUArray Int Float, IOUArray Int Float)
initArrays = do
                y <- newListArray (0,n-1) (randomList maxY n (mkStdGen 23432))
                x <- newListArray (0,n-1) (scanl1 (+) $ randomList maxXgap n (mkStdGen 5462))
                return (x,y)

randomList :: Float -> Int -> StdGen -> [Float]
randomList max n gen = map (abs . ((*) max)) (take n . unfoldr (Just . random) $ gen)

integrate :: IOUArray Int Float -> IOUArray Int Float -> IO Float
integrate x y = iterative x y 0 0

iterative :: IOUArray Int Float -> IOUArray Int Float -> Int -> Float -> IO Float
iterative x y !i !accum = do if i == n-1
                              then return accum
                              else do x1 <- readArray x i
                                      x2 <- readArray x (i+1)
                                      y1 <- readArray y i
                                      y2 <- readArray y (i+1)
                                      iterative x y (i+1) (accum + (y2+y1)/2*(x2-x1))

time :: IO t -> IO t
time a = do
            start <- getCPUTime
            v <- a
            end <- getCPUTime
            let diff = (fromIntegral (end-start)) / (10^9)
            printf "Computation time %0.5f ms
" (diff :: Double)
            return v

The C integration runs in about 7 ms and the Haskell integration in about 60 ms on my system. Of course the Haskell version will be slower, but I am wondering why it is this much slower. Obviously there is a lot of inefficiency in the Haskell code.

Why is the Haskell code so much slower? How could one fix it?

Thanks for any answers.

Amos Robinson · Accepted Answer

Out of curiosity, I ran this with llvm:

ghc Test.hs -O2 -XBangPatterns -fllvm -optlo-O3

and it took it down from 60ms to 24ms. Still not ideal.

So, one of the first things I'll do when I want to know why a benchmark like this is so slow, is dump the prepared core. That is, the core after optimisations.

ghc Test.hs -O2 -ddump-prep -dsuppress-all -XBangPatterns > Test.hscore

After looking through the core, I eventually found $wa, where the iterative loop is defined. It turns out it's making surprisingly many index bound checks. See, I usually use Data.Vector.Unboxed, which has "unsafeRead" and "unsafeIndex" functions, to remove bounds checks. These would be useful here. Personally, I think the vector package is superior.

If you look at $wa, you'll notice it's unboxing the arguments at the start:

case w_s3o9 of _ { STUArray l_s3of u_s3oi ds1_s3ol ds2_s3oH ->
case l_s3of of wild2_s3os { I# m_s3oo ->
case u_s3oi of wild3_s3ot { I# n1_s3ov ->
case ds1_s3ol of wild4_s3oC { I# y1_s3oE ->

this looks bad, but it turns out in the recursive call it's using a specialised version, integrate_$s$wa, with unboxed integers etc. This is good.

In summary, I think you should get a good improvement by using vector with unsafe indexing.

Edit: here is a modified version with Data.Vector. It runs in about 7ms. For good vector code, I think the only slowness compared to C should be due to incomplete alias analysis. https://gist.github.com/amosr/6026995

Thomas M. DuBuisson · Answer

First, I tried your code to reproduce your findings (using GHC 7.6.3 -O2 -fllvm and gcc 4.7.2 and -O3)

$ ./theHaskellVersion-rev1
Computation time 24.00000 ms
25008.195
[tommd@Vodka Test]$ ./theCVersion
5000000 points
25013.105469 area
10.000000 ms

So we are aiming for 10ms if the goal is to perform on par (a 60% reduction in runtime). Looking at your code I see:

Arrays are used, which are ancient and cludgy. I switched to Vector.
There is no worker/wrapper transformation on iterative. The change is just to make an auxillary function in a where clause that doesn't require x and y as parameters.
Float is used even though Double often performs surprisingly better (this probably doesn't matter here).

The end result is on-par with what you posted in C:

$ ghc -O2 so.hs -hide-package random && ./so
Computation time 11.00000 ms
24999.048783785303

Why does Haskell perform so poorly when executing C-like codes? (in this instance at least)

Tags:

performance

c

optimization

haskell

Eric Thoma

2 Answers

Amos Robinson

Thomas M. DuBuisson

Recent Activity

Donate For Us

Why does Haskell perform so poorly when executing C-like codes? (in this instance at least)

Tags:

performance

c

optimization

haskell

Eric Thoma

2 Answers

Amos Robinson

Thomas M. DuBuisson

Related questions

Recent Activity

Donate For Us