Is indexing of Data.Vector.Unboxed.Mutable.MVector really this slow?

Question

I have an application that spends about 80% of its time computing the centroid of a large list (10^7) of high dimensional vectors (dim=100) using the Kahan summation algorithm. I have done my best at optimizing the summation, but it is still 20x slower than an equivalent C implementation. Profiling indicates that the culprits are the unsafeRead and unsafeWrite functions from Data.Vector.Unboxed.Mutable. My question is: are these functions really this slow or am I misunderstanding the profiling statistics?

Here are the two implementations. The Haskell one is compiled with ghc-7.0.3 using the llvm backend. The C one is compiled with llvm-gcc.

Kahan summation in Haskell:

{-# LANGUAGE BangPatterns #-}
module Test where

import Control.Monad ( mapM_ )
import Data.Vector.Unboxed ( Vector, Unbox )
import Data.Vector.Unboxed.Mutable ( MVector )
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as UM
import Data.Word ( Word )
import Data.Bits ( shiftL, shiftR, xor )

prng :: Word -> Word
prng w = w' where
    !w1 = w  `xor` (w  `shiftL` 13)
    !w2 = w1 `xor` (w1 `shiftR` 7)
    !w' = w2 `xor` (w2 `shiftL` 17)

mkVect :: Word -> Vector Double
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng

foldV :: (Unbox a, Unbox b) 
      => (a -> b -> a) -- componentwise function to fold
      -> Vector a      -- initial accumulator value
      -> [Vector b]    -- data vectors
      -> Vector a      -- final accumulator value
foldV fn accum vs = U.modify (\x -> mapM_ (liftV fn x) vs) accum where
    liftV f acc = fV where
        fV v = go 0 where
            n = min (U.length v) (UM.length acc)
            go i | i < n     = step >> go (i + 1)
                 | otherwise = return ()
                 where
                     step = {-# SCC "fV_step" #-} do
                         a <- {-# SCC "fV_read"  #-} UM.unsafeRead acc i
                         b <- {-# SCC "fV_index" #-} U.unsafeIndexM v i
                         {-# SCC "fV_write" #-} UM.unsafeWrite acc i $! {-# SCC "fV_apply" #-} f a b

kahan :: [Vector Double] -> Vector Double
kahan [] = U.singleton 0.0
kahan (v:vs) = fst . U.unzip $ foldV kahanStep acc vs where
    acc = U.map (\z -> (z, 0.0)) v

kahanStep :: (Double, Double) -> Double -> (Double, Double)
kahanStep (s, c) x = (s', c') where
    !y  = x - c
    !s' = s + y
    !c' = (s' - s) - y
{-# NOINLINE kahanStep #-}

zero :: U.Vector Double
zero = U.replicate 100 0.0

myLoop n = kahan $ map mkVect [1..n]

main = print $ myLoop 100000

Compiling with ghc-7.0.3 using the llvm backend:

ghc -o Test_hs --make -fforce-recomp -O3 -fllvm -optlo-O3 -msse2 -main-is Test.main Test.hs

time ./Test_hs
real    0m1.948s
user    0m1.936s
sys     0m0.008s

Profiling information:

16,710,594,992 bytes allocated in the heap
      33,047,064 bytes copied during GC
          35,464 bytes maximum residency (1 sample(s))
          23,888 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 31907 collections,     0 parallel,  0.28s,  0.27s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   24.73s  ( 24.74s elapsed)
  GC    time    0.28s  (  0.27s elapsed)
  RP    time    0.00s  (  0.00s elapsed)
  PROF  time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   25.01s  ( 25.02s elapsed)

  %GC time       1.1%  (1.1% elapsed)

  Alloc rate    675,607,179 bytes per MUT second

  Productivity  98.9% of total user, 98.9% of total elapsed

    Thu Feb 23 02:42 2012 Time and Allocation Profiling Report  (Final)

       Test_hs +RTS -s -p -RTS

    total time  =       24.60 secs   (1230 ticks @ 20 ms)
    total alloc = 8,608,188,392 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

fV_write                       Test                  31.1   26.0
fV_read                        Test                  27.2   23.2
mkVect                         Test                  12.3   27.2
fV_step                        Test                  11.7    0.0
foldV                          Test                   5.9    5.7
fV_index                       Test                   5.2    9.3
kahanStep                      Test                   3.3    6.5
prng                           Test                   2.2    1.8


                                                                                               individual    inherited
COST CENTRE              MODULE                                               no.    entries  %time %alloc   %time %alloc

MAIN                     MAIN                                                   1           0   0.0    0.0   100.0  100.0
 CAF:main1               Test                                                 339           1   0.0    0.0     0.0    0.0
  main                   Test                                                 346           1   0.0    0.0     0.0    0.0
 CAF:main2               Test                                                 338           1   0.0    0.0   100.0  100.0
  main                   Test                                                 347           0   0.0    0.0   100.0  100.0
   myLoop                Test                                                 348           1   0.2    0.2   100.0  100.0
    mkVect               Test                                                 350      400000  12.3   27.2    14.5   29.0
     prng                Test                                                 351     9900000   2.2    1.8     2.2    1.8
    kahan                Test                                                 349         102   0.0    0.0    85.4   70.7
     foldV               Test                                                 359           1   5.9    5.7    85.4   70.7
      fV_step            Test                                                 360     9999900  11.7    0.0    79.5   65.1
       fV_write          Test                                                 367    19999800  31.1   26.0    35.4   32.5
        fV_apply         Test                                                 368     9999900   1.0    0.0     4.3    6.5
         kahanStep       Test                                                 369     9999900   3.3    6.5     3.3    6.5
       fV_index          Test                                                 366     9999900   5.2    9.3     5.2    9.3
       fV_read           Test                                                 361     9999900  27.2   23.2    27.2   23.2
 CAF:lvl19_r3ei          Test                                                 337           1   0.0    0.0     0.0    0.0
  kahan                  Test                                                 358           0   0.0    0.0     0.0    0.0
 CAF:poly_$dPrimMonad3_r3eg Test                                                 336           1   0.0    0.0     0.0    0.0
  kahan                  Test                                                 357           0   0.0    0.0     0.0    0.0
 CAF:$dMVector2_r3ee     Test                                                 335           1   0.0    0.0     0.0    0.0
 CAF:$dVector1_r3ec      Test                                                 334           1   0.0    0.0     0.0    0.0
 CAF:poly_$dMonad_r3ea   Test                                                 333           1   0.0    0.0     0.0    0.0
 CAF:$dMVector1_r3e2     Test                                                 330           1   0.0    0.0     0.0    0.0
 CAF:poly_$dPrimMonad2_r3e0 Test                                                 328           1   0.0    0.0     0.0    0.0
  foldV                  Test                                                 365           0   0.0    0.0     0.0    0.0
 CAF:lvl11_r3dM          Test                                                 322           1   0.0    0.0     0.0    0.0
  kahan                  Test                                                 354           0   0.0    0.0     0.0    0.0
 CAF:lvl10_r3dK          Test                                                 321           1   0.0    0.0     0.0    0.0
  kahan                  Test                                                 355           0   0.0    0.0     0.0    0.0
 CAF:$dMVector_r3dI      Test                                                 320           1   0.0    0.0     0.0    0.0
  kahan                  Test                                                 356           0   0.0    0.0     0.0    0.0
 CAF                     GHC.Float                                            297           1   0.0    0.0     0.0    0.0
 CAF                     GHC.IO.Handle.FD                                     256           2   0.0    0.0     0.0    0.0
 CAF                     GHC.IO.Encoding.Iconv                                214           2   0.0    0.0     0.0    0.0
 CAF                     GHC.Conc.Signal                                      211           1   0.0    0.0     0.0    0.0
 CAF                     Data.Vector.Generic                                  182           1   0.0    0.0     0.0    0.0
 CAF                     Data.Vector.Unboxed                                  174           2   0.0    0.0     0.0    0.0

memory residency by cost center memory residency by type for <code>foldV</code>

The equivalent implementation in C:

#include <stdint.h>
#include <stdio.h>


#define VDIM    100
#define VNUM    100000



uint64_t prng (uint64_t w) {
    w ^= w << 13;
    w ^= w >> 7;
    w ^= w << 17;
    return w;
};

void kahanStep (double *s, double *c, double x) {
    double y, t;
    y  = x - *c;
    t  = *s + y;
    *c = (t - *s) - y;
    *s = t;
}

void kahan(double s[], double c[]) {
    for (int i = 1; i <= VNUM; i++) {
        uint64_t w = i;
        for (int j = 0; j < VDIM; j++) {
                kahanStep(&s[j], &c[j], w);
                w = prng(w);
        }
    }
};


int main (int argc, char* argv[]) {
    double acc[VDIM], err[VDIM];
    for (int i = 0; i < VDIM; i++) {
        acc[i] = err[i] = 0.0;
    };
    kahan(acc, err);
    printf("[ ");
    for (int i = 0; i < VDIM; i++) {
        printf("%g ", acc[i]);
    };
    printf("]
");
};

Compiled with llvm-gcc:

>llvm-gcc -o Test_c -O3 -msse2 -std=c99 test.c

>time ./Test_c
real    0m0.096s
user    0m0.088s
sys     0m0.004s

Update 1: I un-inlined kahanStep in the C version. It barely made a dent in the performance. I hope that now we can all acknowledge Amdahl's law and move on. As inefficient as kahanStep might be, unsafeRead and unsafeWrite are 9-10x slower. I was hoping someone could shed some light on the possible causes of that fact.

Also, I should say that since I am interacting with a library that uses Data.Vector.Unboxed, so I am kinda married to it at this point, and parting with it would be very traumatic :-)

Update 2: I guess I was not clear enough in my original question. I am not looking for ways to speed up this microbenchmark. I am looking for an explanation of the counter intuitive profiling stats, so I can decide whether or not to file a bug report against vector.

Daniel Fischer · Accepted Answer

Your C version is not equivalent to your Haskell implementation. In C you've inlined the important Kahan summation step yourself, in Haskell you created a polymorphic higher order function that does a lot more and takes the transformation step as a parameter. Moving kahanStep to a separate function in C isn't the point, it will still be inlined by the compiler. Even if you put it into its own source file, compile separately and link without link-time optimisation, you have only addressed part of the difference.

I have made a C version that is closer to the Haskell version,

kahan.h:

typedef struct DPair_st {
    double fst, snd;
    } DPair;

DPair kahanStep(DPair pr, double x);

kahanStep.c:

#include "kahan.h"

DPair kahanStep (DPair pr, double x) {
    double y, t;
    y  = x - pr.snd;
    t  = pr.fst + y;
    pr.snd = (t - pr.fst) - y;
    pr.fst = t;
    return pr;
}

main.c:

#include <stdint.h>
#include <stdio.h>
#include "kahan.h"


#define VDIM    100
#define VNUM    100000

uint64_t prng (uint64_t w) {
    w ^= w << 13;
    w ^= w >> 7;
    w ^= w << 17;
    return w;
};

void kahan(double s[], double c[], DPair (*fun)(DPair,double)) {
    for (int i = 1; i <= VNUM; i++) {
        uint64_t w = i;
        for (int j = 0; j < VDIM; j++) {
            DPair pr;
            pr.fst = s[j];
            pr.snd = c[j];
            pr = fun(pr,w);
            s[j] = pr.fst;
            c[j] = pr.snd;
            w = prng(w);
        }
    }
};


int main (int argc, char* argv[]) {
    double acc[VDIM], err[VDIM];
    for (int i = 0; i < VDIM; i++) {
        acc[i] = err[i] = 0.0;
    };
    kahan(acc, err,kahanStep);
    printf("[ ");
    for (int i = 0; i < VDIM; i++) {
        printf("%g ", acc[i]);
    };
    printf("]
");
};

Compiled separately and linked, that runs about 25% slower than the first C version here (0.1s vs. 0.079s).

Now you have a higher order function in C, considerably slower than the original, but still much faster than the Haskell code. One important difference is that the C function takes an unboxed pair of doubles and an unboxed double as arguments, while the Haskell kahanStep takes a boxed pair of boxed Doubles and a boxed Double and returns a boxed pair of boxed Doubles, requiring expensive boxing and unboxing in the foldV loop. That is addressable by more inlining. Explicitly inlining foldV, kahanStep, and step brings the time down from 0.90s to 0.74s here with ghc-7.0.4 (it has a smaller effect on ghc-7.4.1's output, from 0.99s down to 0.90s).

But the boxing and unboxing is, alas, the smaller part of the difference. foldV does much more than C's kahan, it takes a list of vectors used to modify the accumulator. That list of vectors is completely absent in the C code, and that makes a big difference. All these 100000 vectors have to be allocated, filled and put into a list (due to laziness, not all of them are simultaneously alive, so there's no space problem, but they, as well as the list cells, have to be allocated and garbage collected, which takes considerable time). And in the loop proper, instead of having a Word# passed in a register, the precomputed value is read from the vector.

If you use a more direct translation of the C to Haskell,

{-# LANGUAGE CPP, BangPatterns #-}
module Main (main) where

#define VDIM 100
#define VNUM 100000

import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits

prng :: Word -> Word
prng w = w'
  where
    !w1 = w `xor` (w `shiftL` 13)
    !w2 = w1 `xor` (w1 `shiftR` 7)
    !w' = w2 `xor` (w2 `shiftL` 17)

type Vec s = STUArray s Int Double

kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
    let inner w j
            | j < VDIM  = do
                !cj <- unsafeRead c j
                !sj <- unsafeRead s j
                let !y = fromIntegral w - cj
                    !t = sj + y
                    !w' = prng w
                unsafeWrite c j ((t-sj)-y)
                unsafeWrite s j t
                inner w' (j+1)
            | otherwise = return ()
    forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0

calc :: ST s (Vec s)
calc = do
    s <- newArray (0,VDIM-1) 0
    c <- newArray (0,VDIM-1) 0
    kahan s c
    return s

main :: IO ()
main = print . elems $ runSTUArray calc

it's much faster. Admittedly it's still about three times slower than the C, but the original was 13 times slower here (and I don't have llvm installed, so I use vanilla gcc and the native backed of GHC, using llvm may give slightly different results).

I don't think that indexing is really the culprit. The vector package heavily relies on compiler magic, but compiling for profiling support massively interferes with that. For packages like vector or bytestring which use their own fusion framework for optimisation, the profiling interference can be rather disastrous and the profiling results utterly useless. I'm inclined to believe we have such a case here.

In the Core, all reads and writes are transformed to the primops readDoubleArray#, indexDoubleArray# and writeDoubleArray#, which are fast. Maybe a bit slower than a C array access, but not very much. So I'm confident that that's not the problem and the cause of the big difference. But you have put {-# SCC #-} annotations on them, so disabling any optimisation involving a rearrangement of any of those terms. And each time one of these points is entered, it has to be recorded. I'm not familiar enough with the profiler and optimiser to know what exactly happens, but, as a data point, with the {-# INLINE #-} pragmas on foldV, step and kahanStep, a profiling run with these SCCs took 3.17s, and with the SCCs fV_step, fV_read, fV_index, fV_write and fV_apply removed (nothing else changed) a profiling run took only 2.03s (both times as reported by +RTS -P, so with the profiling overhead subtracted). That difference shows that SCCs on cheap functions and too fine-grained SCCs can massively skew the profiling results. Now if we also put {-# INLINE #-} pragmas on mkVect, kahan and prng, we are left with a completely uninformative profile, but the run takes only 1.23s. (These last inlinings have, however, no effect for the non-profiling runs, without profiling, they are inlined automatically.)

So, don't take the profiling results as unquestionable truths. The more your code (directly or indirectly through the libraries used) depends on optimisations, the more it is vulnerable to misleading profiling results caused by disabled optimisations. This also holds, but to a much lesser extent, for heap-profiling to pin down space leaks.

When you have a suspicious profiling result, check what happens when you remove some SCCs. If that results in a big drop of run time, that SCC was not your primary problem (it may become a problem again after other problems have been fixed).

Looking at the Core generated for your programme, what jumped out was that your kahanStep - by the way, remove the {-# NOINLINE #-} pragma from that, it's counter-productive - produced a boxed pair of boxed Doubles in the loop, that was immediately deconstructed and the components unboxed. Such unnecessary intermediate boxing of values is expensive and slows computations down a lot.

As this came up on haskell-cafe again today where somebody got terrible performance from the above code with ghc-7.4.1, tibbe took it upon himself to investigate the core that GHC produced and found that GHC produced suboptimal code for the conversion from Word to Double. Replacing the fromIntegral of the conversion with a custom conversion using only (wrapped) primitives (and removing the bang patterns that don't make a difference here, GHC's strictness analyser is good enough to see through the algorithm, I should learn to trust it more ;), we obtain a version that is on par with gcc -O3's output for the original C:

{-# LANGUAGE CPP #-}
module Main (main) where

#define VDIM 100
#define VNUM 100000

import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
import GHC.Float (int2Double)

prng :: Word -> Word
prng w = w'
  where
    w1 = w `xor` (w `shiftL` 13)
    w2 = w1 `xor` (w1 `shiftR` 7)
    w' = w2 `xor` (w2 `shiftL` 17)

type Vec s = STUArray s Int Double

kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
    let inner w j
            | j < VDIM  = do
                cj <- unsafeRead c j
                sj <- unsafeRead s j
                let y = word2Double w - cj
                    t = sj + y
                    w' = prng w
                unsafeWrite c j ((t-sj)-y)
                unsafeWrite s j t
                inner w' (j+1)
            | otherwise = return ()
    forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0

calc :: ST s (Vec s)
calc = do
    s <- newArray (0,VDIM-1) 0
    c <- newArray (0,VDIM-1) 0
    kahan s c
    return s

correction :: Double
correction = 2 * int2Double minBound

word2Double :: Word -> Double
word2Double w = case fromIntegral w of
                  i | i < 0 -> int2Double i - correction
                    | otherwise -> int2Double i

main :: IO ()
main = print . elems $ runSTUArray calc

applicative · Answer

There is a funny mixing in of list combinators in all of this seemingly Data.Vector code. If I make the very first obvious emendation, replacing

mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng

with the correct use of Data.Vector.Unboxed:

mkVect = U.force . U.map fromIntegral . U.iterateN 100 prng

then my time falls by two thirds -- from real 0m1.306s to real 0m0.429s It looks like all of the top level functions have this problem except prng and zero

Is indexing of Data.Vector.Unboxed.Mutable.MVector really this slow?

Tags:

performance

floating-point

profiling

haskell

vector

Alinabi

2 Answers

Daniel Fischer

applicative

Recent Activity

Donate For Us

Is indexing of Data.Vector.Unboxed.Mutable.MVector really this slow?

Tags:

performance

floating-point

profiling

haskell

vector

Alinabi

2 Answers

Daniel Fischer

applicative

Related questions

Recent Activity

Donate For Us