I have an application that spends about 80% of its time computing the centroid of a large list (10^7) of high dimensional vectors (dim=100) using the Kahan summation algorithm. I have done my best at optimizing the summation, but it is still 20x slower than an equivalent C implementation. Profiling indicates that the culprits are the unsafeRead
and unsafeWrite
functions from Data.Vector.Unboxed.Mutable
. My question is: are these functions really this slow or am I misunderstanding the profiling statistics?
Here are the two implementations. The Haskell one is compiled with ghc-7.0.3 using the llvm backend. The C one is compiled with llvm-gcc.
Kahan summation in Haskell:
{-# LANGUAGE BangPatterns #-}
module Test where
import Control.Monad ( mapM_ )
import Data.Vector.Unboxed ( Vector, Unbox )
import Data.Vector.Unboxed.Mutable ( MVector )
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as UM
import Data.Word ( Word )
import Data.Bits ( shiftL, shiftR, xor )
prng :: Word -> Word
prng w = w' where
!w1 = w `xor` (w `shiftL` 13)
!w2 = w1 `xor` (w1 `shiftR` 7)
!w' = w2 `xor` (w2 `shiftL` 17)
mkVect :: Word -> Vector Double
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng
foldV :: (Unbox a, Unbox b)
=> (a -> b -> a) -- componentwise function to fold
-> Vector a -- initial accumulator value
-> [Vector b] -- data vectors
-> Vector a -- final accumulator value
foldV fn accum vs = U.modify (\x -> mapM_ (liftV fn x) vs) accum where
liftV f acc = fV where
fV v = go 0 where
n = min (U.length v) (UM.length acc)
go i | i < n = step >> go (i + 1)
| otherwise = return ()
where
step = {-# SCC "fV_step" #-} do
a <- {-# SCC "fV_read" #-} UM.unsafeRead acc i
b <- {-# SCC "fV_index" #-} U.unsafeIndexM v i
{-# SCC "fV_write" #-} UM.unsafeWrite acc i $! {-# SCC "fV_apply" #-} f a b
kahan :: [Vector Double] -> Vector Double
kahan [] = U.singleton 0.0
kahan (v:vs) = fst . U.unzip $ foldV kahanStep acc vs where
acc = U.map (\z -> (z, 0.0)) v
kahanStep :: (Double, Double) -> Double -> (Double, Double)
kahanStep (s, c) x = (s', c') where
!y = x - c
!s' = s + y
!c' = (s' - s) - y
{-# NOINLINE kahanStep #-}
zero :: U.Vector Double
zero = U.replicate 100 0.0
myLoop n = kahan $ map mkVect [1..n]
main = print $ myLoop 100000
Compiling with ghc-7.0.3 using the llvm backend:
ghc -o Test_hs --make -fforce-recomp -O3 -fllvm -optlo-O3 -msse2 -main-is Test.main Test.hs
time ./Test_hs
real 0m1.948s
user 0m1.936s
sys 0m0.008s
Profiling information:
16,710,594,992 bytes allocated in the heap
33,047,064 bytes copied during GC
35,464 bytes maximum residency (1 sample(s))
23,888 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 31907 collections, 0 parallel, 0.28s, 0.27s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 24.73s ( 24.74s elapsed)
GC time 0.28s ( 0.27s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 25.01s ( 25.02s elapsed)
%GC time 1.1% (1.1% elapsed)
Alloc rate 675,607,179 bytes per MUT second
Productivity 98.9% of total user, 98.9% of total elapsed
Thu Feb 23 02:42 2012 Time and Allocation Profiling Report (Final)
Test_hs +RTS -s -p -RTS
total time = 24.60 secs (1230 ticks @ 20 ms)
total alloc = 8,608,188,392 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
fV_write Test 31.1 26.0
fV_read Test 27.2 23.2
mkVect Test 12.3 27.2
fV_step Test 11.7 0.0
foldV Test 5.9 5.7
fV_index Test 5.2 9.3
kahanStep Test 3.3 6.5
prng Test 2.2 1.8
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
CAF:main1 Test 339 1 0.0 0.0 0.0 0.0
main Test 346 1 0.0 0.0 0.0 0.0
CAF:main2 Test 338 1 0.0 0.0 100.0 100.0
main Test 347 0 0.0 0.0 100.0 100.0
myLoop Test 348 1 0.2 0.2 100.0 100.0
mkVect Test 350 400000 12.3 27.2 14.5 29.0
prng Test 351 9900000 2.2 1.8 2.2 1.8
kahan Test 349 102 0.0 0.0 85.4 70.7
foldV Test 359 1 5.9 5.7 85.4 70.7
fV_step Test 360 9999900 11.7 0.0 79.5 65.1
fV_write Test 367 19999800 31.1 26.0 35.4 32.5
fV_apply Test 368 9999900 1.0 0.0 4.3 6.5
kahanStep Test 369 9999900 3.3 6.5 3.3 6.5
fV_index Test 366 9999900 5.2 9.3 5.2 9.3
fV_read Test 361 9999900 27.2 23.2 27.2 23.2
CAF:lvl19_r3ei Test 337 1 0.0 0.0 0.0 0.0
kahan Test 358 0 0.0 0.0 0.0 0.0
CAF:poly_$dPrimMonad3_r3eg Test 336 1 0.0 0.0 0.0 0.0
kahan Test 357 0 0.0 0.0 0.0 0.0
CAF:$dMVector2_r3ee Test 335 1 0.0 0.0 0.0 0.0
CAF:$dVector1_r3ec Test 334 1 0.0 0.0 0.0 0.0
CAF:poly_$dMonad_r3ea Test 333 1 0.0 0.0 0.0 0.0
CAF:$dMVector1_r3e2 Test 330 1 0.0 0.0 0.0 0.0
CAF:poly_$dPrimMonad2_r3e0 Test 328 1 0.0 0.0 0.0 0.0
foldV Test 365 0 0.0 0.0 0.0 0.0
CAF:lvl11_r3dM Test 322 1 0.0 0.0 0.0 0.0
kahan Test 354 0 0.0 0.0 0.0 0.0
CAF:lvl10_r3dK Test 321 1 0.0 0.0 0.0 0.0
kahan Test 355 0 0.0 0.0 0.0 0.0
CAF:$dMVector_r3dI Test 320 1 0.0 0.0 0.0 0.0
kahan Test 356 0 0.0 0.0 0.0 0.0
CAF GHC.Float 297 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 256 2 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 214 2 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 211 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Generic 182 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Unboxed 174 2 0.0 0.0 0.0 0.0
The equivalent implementation in C:
#include <stdint.h>
#include <stdio.h>
#define VDIM 100
#define VNUM 100000
uint64_t prng (uint64_t w) {
w ^= w << 13;
w ^= w >> 7;
w ^= w << 17;
return w;
};
void kahanStep (double *s, double *c, double x) {
double y, t;
y = x - *c;
t = *s + y;
*c = (t - *s) - y;
*s = t;
}
void kahan(double s[], double c[]) {
for (int i = 1; i <= VNUM; i++) {
uint64_t w = i;
for (int j = 0; j < VDIM; j++) {
kahanStep(&s[j], &c[j], w);
w = prng(w);
}
}
};
int main (int argc, char* argv[]) {
double acc[VDIM], err[VDIM];
for (int i = 0; i < VDIM; i++) {
acc[i] = err[i] = 0.0;
};
kahan(acc, err);
printf("[ ");
for (int i = 0; i < VDIM; i++) {
printf("%g ", acc[i]);
};
printf("]\n");
};
Compiled with llvm-gcc:
>llvm-gcc -o Test_c -O3 -msse2 -std=c99 test.c
>time ./Test_c
real 0m0.096s
user 0m0.088s
sys 0m0.004s
Update 1: I un-inlined kahanStep
in the C version. It barely made a dent in the performance. I hope that now we can all acknowledge Amdahl's law and move on. As
inefficient as kahanStep
might be, unsafeRead
and unsafeWrite
are 9-10x slower. I was hoping someone could shed some light on the possible causes of that fact.
Also, I should say that since I am interacting with a library that uses Data.Vector.Unboxed
, so I am kinda married to it at this point, and parting with it would be very traumatic :-)
Update 2: I guess I was not clear enough in my original question. I am not looking for ways to speed up this microbenchmark. I am looking for an explanation of the counter intuitive profiling stats, so I can decide whether or not to file a bug report against vector
.
Your C version is not equivalent to your Haskell implementation. In C you've inlined the important Kahan summation step yourself, in Haskell you created a polymorphic higher order function that does a lot more and takes the transformation step as a parameter. Moving kahanStep
to a separate function in C isn't the point, it will still be inlined by the compiler. Even if you put it into its own source file, compile separately and link without link-time optimisation, you have only addressed part of the difference.
I have made a C version that is closer to the Haskell version,
kahan.h:
typedef struct DPair_st {
double fst, snd;
} DPair;
DPair kahanStep(DPair pr, double x);
kahanStep.c:
#include "kahan.h"
DPair kahanStep (DPair pr, double x) {
double y, t;
y = x - pr.snd;
t = pr.fst + y;
pr.snd = (t - pr.fst) - y;
pr.fst = t;
return pr;
}
main.c:
#include <stdint.h>
#include <stdio.h>
#include "kahan.h"
#define VDIM 100
#define VNUM 100000
uint64_t prng (uint64_t w) {
w ^= w << 13;
w ^= w >> 7;
w ^= w << 17;
return w;
};
void kahan(double s[], double c[], DPair (*fun)(DPair,double)) {
for (int i = 1; i <= VNUM; i++) {
uint64_t w = i;
for (int j = 0; j < VDIM; j++) {
DPair pr;
pr.fst = s[j];
pr.snd = c[j];
pr = fun(pr,w);
s[j] = pr.fst;
c[j] = pr.snd;
w = prng(w);
}
}
};
int main (int argc, char* argv[]) {
double acc[VDIM], err[VDIM];
for (int i = 0; i < VDIM; i++) {
acc[i] = err[i] = 0.0;
};
kahan(acc, err,kahanStep);
printf("[ ");
for (int i = 0; i < VDIM; i++) {
printf("%g ", acc[i]);
};
printf("]\n");
};
Compiled separately and linked, that runs about 25% slower than the first C version here (0.1s vs. 0.079s).
Now you have a higher order function in C, considerably slower than the original, but still much faster than the Haskell code. One important difference is that the C function takes an unboxed pair of double
s and an unboxed double
as arguments, while the Haskell kahanStep
takes a boxed pair of boxed Double
s and a boxed Double
and returns a boxed pair of boxed Double
s, requiring expensive boxing and unboxing in the foldV
loop. That is addressable by more inlining. Explicitly inlining foldV
, kahanStep
, and step
brings the time down from 0.90s to 0.74s here with ghc-7.0.4 (it has a smaller effect on ghc-7.4.1's output, from 0.99s down to 0.90s).
But the boxing and unboxing is, alas, the smaller part of the difference. foldV
does much more than C's kahan
, it takes a list of vectors used to modify the accumulator. That list of vectors is completely absent in the C code, and that makes a big difference. All these 100000 vectors have to be allocated, filled and put into a list (due to laziness, not all of them are simultaneously alive, so there's no space problem, but they, as well as the list cells, have to be allocated and garbage collected, which takes considerable time). And in the loop proper, instead of having a Word#
passed in a register, the precomputed value is read from the vector.
If you use a more direct translation of the C to Haskell,
{-# LANGUAGE CPP, BangPatterns #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
prng :: Word -> Word
prng w = w'
where
!w1 = w `xor` (w `shiftL` 13)
!w2 = w1 `xor` (w1 `shiftR` 7)
!w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner w j
| j < VDIM = do
!cj <- unsafeRead c j
!sj <- unsafeRead s j
let !y = fromIntegral w - cj
!t = sj + y
!w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
main :: IO ()
main = print . elems $ runSTUArray calc
it's much faster. Admittedly it's still about three times slower than the C, but the original was 13 times slower here (and I don't have llvm installed, so I use vanilla gcc and the native backed of GHC, using llvm may give slightly different results).
I don't think that indexing is really the culprit. The vector package heavily relies on compiler magic, but compiling for profiling support massively interferes with that. For packages like vector
or bytestring
which use their own fusion framework for optimisation, the profiling interference can be rather disastrous and the profiling results utterly useless. I'm inclined to believe we have such a case here.
In the Core, all reads and writes are transformed to the primops readDoubleArray#
, indexDoubleArray#
and writeDoubleArray#
, which are fast. Maybe a bit slower than a C array access, but not very much. So I'm confident that that's not the problem and the cause of the big difference. But you have put {-# SCC #-}
annotations on them, so disabling any optimisation involving a rearrangement of any of those terms. And each time one of these points is entered, it has to be recorded. I'm not familiar enough with the profiler and optimiser to know what exactly happens, but, as a data point, with the {-# INLINE #-}
pragmas on foldV
, step
and kahanStep
, a profiling run with these SCCs took 3.17s, and with the SCCs fV_step
, fV_read
, fV_index
, fV_write
and fV_apply
removed (nothing else changed) a profiling run took only 2.03s (both times as reported by +RTS -P
, so with the profiling overhead subtracted). That difference shows that SCCs on cheap functions and too fine-grained SCCs can massively skew the profiling results. Now if we also put {-# INLINE #-}
pragmas on mkVect
, kahan
and prng
, we are left with a completely uninformative profile, but the run takes only 1.23s. (These last inlinings have, however, no effect for the non-profiling runs, without profiling, they are inlined automatically.)
So, don't take the profiling results as unquestionable truths. The more your code (directly or indirectly through the libraries used) depends on optimisations, the more it is vulnerable to misleading profiling results caused by disabled optimisations. This also holds, but to a much lesser extent, for heap-profiling to pin down space leaks.
When you have a suspicious profiling result, check what happens when you remove some SCCs. If that results in a big drop of run time, that SCC was not your primary problem (it may become a problem again after other problems have been fixed).
Looking at the Core generated for your programme, what jumped out was that your kahanStep
- by the way, remove the {-# NOINLINE #-}
pragma from that, it's counter-productive - produced a boxed pair of boxed Double
s in the loop, that was immediately deconstructed and the components unboxed. Such unnecessary intermediate boxing of values is expensive and slows computations down a lot.
As this came up on haskell-cafe again today where somebody got terrible performance from the above code with ghc-7.4.1, tibbe took it upon himself to investigate the core that GHC produced and found that GHC produced suboptimal code for the conversion from Word
to Double
. Replacing the fromIntegral
of the conversion with a custom conversion using only (wrapped) primitives (and removing the bang patterns that don't make a difference here, GHC's strictness analyser is good enough to see through the algorithm, I should learn to trust it more ;), we obtain a version that is on par with gcc -O3
's output for the original C:
{-# LANGUAGE CPP #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
import GHC.Float (int2Double)
prng :: Word -> Word
prng w = w'
where
w1 = w `xor` (w `shiftL` 13)
w2 = w1 `xor` (w1 `shiftR` 7)
w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner w j
| j < VDIM = do
cj <- unsafeRead c j
sj <- unsafeRead s j
let y = word2Double w - cj
t = sj + y
w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
correction :: Double
correction = 2 * int2Double minBound
word2Double :: Word -> Double
word2Double w = case fromIntegral w of
i | i < 0 -> int2Double i - correction
| otherwise -> int2Double i
main :: IO ()
main = print . elems $ runSTUArray calc
There is a funny mixing in of list combinators in all of this seemingly Data.Vector
code. If I make the very first obvious emendation, replacing
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng
with the correct use of Data.Vector.Unboxed
:
mkVect = U.force . U.map fromIntegral . U.iterateN 100 prng
then my time falls by two thirds -- from real 0m1.306s
to real 0m0.429s
It looks like all of the top level functions have this problem except prng
and zero
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With