I'm writing a game in Haskell, and my current pass at the UI involves a lot of procedural generation of geometry. I am currently focused on identifying performance of one particular operation (C-ish pseudocode):
Vec4f multiplier, addend;
Vec4f vecList[];
for (int i = 0; i < count; i++)
    vecList[i] = vecList[i] * multiplier + addend;
That is, a bog-standard multiply-add of four floats, the kind of thing ripe for SIMD optimization.
The result is going to an OpenGL vertex buffer, so it has to get dumped into a flat C array eventually. For the same reason, the calculations should probably be done on C 'float' types.
I've looked for either a library or a native idiomatic solution to do this sort of thing quickly in Haskell, but every solution I've come up with seems to hover around 2% of the performance (that is, 50x slower) compared to C from GCC with the right flags. Granted, I started with Haskell a couple weeks ago, so my experience is limited—which is why I'm coming to you guys. Can any of you offer suggestions for a faster Haskell implementation, or pointers to documentation on how to write high-performance Haskell code?
First, the most recent Haskell solution (clocks about 12 seconds). I tried the bang-patterns stuff from this SO post, but it didn't make a difference AFAICT. Replacing 'multAdd' with '(\i v -> v * 4)' brought execution time down to 1.9 seconds, so the bitwise stuff (and consequent challenges to automatic optimization) doesn't seem to be too much at fault.
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-}
import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign.C.Types
import Data.Bits
repCount = 10000
arraySize = 20000
a = fromList $ [0.2::CFloat,  0.1, 0.6, 1.0]
m = fromList $ [0.99::CFloat, 0.7, 0.8, 0.6]
multAdd :: Int -> CFloat -> CFloat
multAdd !i !v = v * (m ! (i .&. 3)) + (a ! (i .&. 3))
multList :: Int -> Vector CFloat -> Vector CFloat
multList !count !src
    | count <= 0    = src
    | otherwise     = multList (count-1) $ V.imap multAdd src
main = do
    print $ Data.Vector.Storable.sum $ multList repCount $ 
        Data.Vector.Storable.replicate (arraySize*4) (0::CFloat)
Here's what I have in C. The code here has a few #ifdefs which prevents it from being compiled straight-up; scroll down for the test driver.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
typedef float v4fs __attribute__ ((vector_size (16)));
typedef struct { float x, y, z, w; } Vector4;
void setv4(v4fs *v, float x, float y, float z, float w) {
    float *a = (float*) v;
    a[0] = x;
    a[1] = y;
    a[2] = z;
    a[3] = w;
}
float sumv4(v4fs *v) {
    float *a = (float*) v;
    return a[0] + a[1] + a[2] + a[3];
}
void vecmult(v4fs *MAYBE_RESTRICT s, v4fs *MAYBE_RESTRICT d, v4fs a, v4fs m) {
    for (int j = 0; j < N; j++) {
        d[j] = s[j] * m + a;
    }
}
void scamult(float *MAYBE_RESTRICT s, float *MAYBE_RESTRICT d,
             Vector4 a, Vector4 m) {
    for (int j = 0; j < (N*4); j+=4) {
        d[j+0] = s[j+0] * m.x + a.x;
        d[j+1] = s[j+1] * m.y + a.y;
        d[j+2] = s[j+2] * m.z + a.z;
        d[j+3] = s[j+3] * m.w + a.w;
    }
}
int main () {
    v4fs a, m;
    v4fs *s, *d;
    setv4(&a, 0.2, 0.1, 0.6, 1.0);
    setv4(&m, 0.99, 0.7, 0.8, 0.6);
    s = calloc(N, sizeof(v4fs));
    d = s;
    double start = clock();
    for (int i = 0; i < M; i++) {
#ifdef COPY
        d = malloc(N * sizeof(v4fs));
#endif
#ifdef VECTOR
        vecmult(s, d, a, m);
#else
        Vector4 aa = *(Vector4*)(&a);
        Vector4 mm = *(Vector4*)(&m);
        scamult((float*)s, (float*)d, aa, mm);
#endif
#ifdef COPY
        free(s);
        s = d;
#endif
    }
    double end = clock();
    float sum = 0;
    for (int j = 0; j < N; j++) {
        sum += sumv4(s+j);
    }
    printf("%-50s %2.5f %f\n\n", NAME,
            (end - start) / (double) CLOCKS_PER_SEC, sum);
}
This script will compile and run the tests with a number of gcc flag combinations. The best performance was had by cmath-64-native-O3-restrict-vector-nocopy on my system, taking 0.22 seconds.
import System.Process
import GHC.IOBase
cBase = ("cmath", "gcc mult.c -ggdb --std=c99 -DM=10000 -DN=20000")
cOptions = [
            [("32", "-m32"), ("64", "-m64")],
            [("generic", ""), ("native", "-march=native -msse4")],
            [("O1", "-O1"), ("O2", "-O2"), ("O3", "-O3")],
            [("restrict", "-DMAYBE_RESTRICT=__restrict__"),
                ("norestrict", "-DMAYBE_RESTRICT=")],
            [("vector", "-DVECTOR"), ("scalar", "")],
            [("copy", "-DCOPY"), ("nocopy", "")]
           ]
-- Fold over the Cartesian product of the double list. Probably a Prelude function
-- or two that does this, but hey. The 'perm' referred to permutations until I realized
-- that this wasn't actually doing permutations. '
permfold :: (a -> a -> a) -> a -> [[a]] -> [a]
permfold f z [] = [z]
permfold f z (x:xs) = concat $ map (\a -> (permfold f (f z a) xs)) x
prepCmd :: (String, String) -> (String, String) -> (String, String)
prepCmd (name, cmd) (namea, cmda) =
    (name ++ "-" ++ namea, cmd ++ " " ++ cmda)
runCCmd name compileCmd = do
    res <- system (compileCmd ++ " -DNAME=\\\"" ++ name ++ "\\\" -o " ++ name)
    if res == ExitSuccess
        then do system ("./" ++ name)
                return ()
        else    putStrLn $ name ++ " did not compile"
main = do
    mapM_ (uncurry runCCmd) $ permfold prepCmd cBase cOptions
Well, this is better. 3.5s instead of 14s.
{-# LANGUAGE BangPatterns #-}
{-
-- multiply-add of four floats,
Vec4f multiplier, addend;
Vec4f vecList[];
for (int i = 0; i < count; i++)
    vecList[i] = vecList[i] * multiplier + addend;
-}
import qualified Data.Vector.Storable as V
import Data.Vector.Storable (Vector)
import Data.Bits
repCount, arraySize :: Int
repCount = 10000
arraySize = 20000
a, m :: Vector Float
a = V.fromList [0.2,  0.1, 0.6, 1.0]
m = V.fromList [0.99, 0.7, 0.8, 0.6]
multAdd :: Int -> Float -> Float
multAdd i v = v * (m `V.unsafeIndex` (i .&. 3)) + (a `V.unsafeIndex` (i .&. 3))
go :: Int -> Vector Float -> Vector Float
go n s
    | n <= 0    = s
    | otherwise = go (n-1) (f s)
  where
    f = V.imap multAdd
main = print . V.sum $ go repCount v
  where
    v :: Vector Float
    v = V.replicate (arraySize * 4) 0
            -- ^ a flattened Vec4f []
Which is better than it was:
$ ghc -O2 --make A.hs
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...
$ time ./A
516748.13
./A  3.58s user 0.01s system 99% cpu 3.593 total
multAdd compiles just fine:
        case readFloatOffAddr#
               rb_aVn
               (word2Int#
                  (and# (int2Word# sc1_s1Yx) __word 3))
               realWorld#
        of _ { (# s25_X1Tb, x4_X1Te #) ->
        case readFloatOffAddr#
               rb11_X118
               (word2Int#
                  (and# (int2Word# sc1_s1Yx) __word 3))
               realWorld#
        of _ { (# s26_X1WO, x5_X20B #) ->
        case writeFloatOffAddr#
               @ RealWorld
               a17_s1Oe
               sc3_s1Yz
               (plusFloat#
                  (timesFloat# x3_X1Qz x4_X1Te) x5_X20B)
However, you're doing 4-element at a time multiplies in the C code, so we'll need to do that directly, rather than faking it by looping and masking. GCC is probably unrolling the loop, too.
So to get identical performance, we'd need the vector multiply (a bit hard, possibly via the LLVM backend) and unroll the loop (possibly fusing it). I'll defer to Roman here to see if there's other obvious things.
One idea might be to actually use a Vector Vec4, rather than flattening it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With