Doing efficient Numerics in Haskell

Question

I was inspired by this post called "Only fast languages are interesting" to look at the problem he suggests (sum'ing a couple of million numbers from a vector) in Haskell and compare to his results.

I'm a Haskell newbie so I don't really know how to time correctly or how to do this efficiently, my first attempt at this problem was the following. Note that I'm not using random numbers in the vector as I'm not sure how to do in a good way. I'm also printing stuff in order to ensure full evaluation.

import System.TimeIt

import Data.Vector as V

vector :: IO (Vector Int)
vector = do
  let vec = V.replicate 3000000 10
  print $ V.length vec
  return vec

sumit :: IO ()
sumit = do
  vec <- vector
  print $ V.sum vec

time = timeIt sumit

Loading this up in GHCI and running time tells me that it took about 0.22s to run for 3 million numbers and 2.69s for 30 million numbers.

Compared to the blog authors results of 0.02s and 0.18s in Lush it's quite a lot worse, which leads me to believe this can be done in a better way.

Note: The above code needs the package TimeIt to run. cabal install timeit will get it for you.

hammar · Accepted Answer

First of all, realize that GHCi is an interpreter, and it's not designed to be very fast. To get more useful results you should compile the code with optimizations enabled. This can make a huge difference.

Also, for any serious benchmarking of Haskell code, I recommend using criterion. It uses various statistical techniques to ensure that you're getting reliable measurements.

I modified your code to use criterion and removed the print statements so that we're not timing the I/O.

import Criterion.Main
import Data.Vector as V

vector :: IO (Vector Int)
vector = do
  let vec = V.replicate 3000000 10
  return vec

sumit :: IO Int
sumit = do
  vec <- vector
  return $ V.sum vec

main = defaultMain [bench "sumit" $ whnfIO sumit]

Compiling this with -O2, I get this result on a pretty slow netbook:

$ ghc --make -O2 Sum.hs
$ ./Sum 
warming up
estimating clock resolution...
mean is 56.55146 us (10001 iterations)
found 1136 outliers among 9999 samples (11.4%)
  235 (2.4%) high mild
  901 (9.0%) high severe
estimating cost of a clock call...
mean is 2.493841 us (38 iterations)
found 4 outliers among 38 samples (10.5%)
  2 (5.3%) high mild
  2 (5.3%) high severe

benchmarking sumit
collecting 100 samples, 8 iterations each, in estimated 6.180620 s
mean: 9.329556 ms, lb 9.222860 ms, ub 9.473564 ms, ci 0.950
std dev: 628.0294 us, lb 439.1394 us, ub 1.045119 ms, ci 0.950

So I'm getting an average of just over 9 ms with a standard deviation of less than a millisecond. For the larger test case, I'm getting about 100ms.

Enabling optimizations is especially important when using the vector package, as it makes heavy use of stream fusion, which in this case is able to eliminate the data structure entirely, turning your program into an efficient, tight loop.

It may also be worthwhile to experiment with the new LLVM-based code generator by using -fllvm option. It is apparently well-suited for numeric code.

applicative · Answer

Your original file, uncompiled, then compiled without optimization, then compiled with a simple optimization flag:

$ runhaskell boxed.hs  
3000000
30000000
CPU time:   0.35s

$ ghc --make boxed.hs -o unoptimized 
$ ./unoptimized
3000000
30000000
CPU time:   0.34s



$ ghc --make -O2 boxed.hs 
$ ./boxed
3000000
30000000
CPU time:   0.09s

Your file with import qualified Data.Vector.Unboxed as V instead of import qualified Data.Vector as V (Int is an unboxable type) -- first without optimization then with:

$ ghc --make unboxed.hs -o unoptimized
$ ./unoptimized
3000000
30000000
CPU time:   0.27s


$ ghc --make -O2 unboxed.hs 
$ ./unboxed
3000000
30000000
CPU time:   0.04s

So, compile, optimize ... and where possible use Data.Vector.Unboxed

ertes · Answer

Try to use an unboxed vector, although I'm not sure whether it makes a noticable difference in this case. Note also that the comparison is slightly unfair, because the vector package should optimize the vector away entirely (this optimization is called stream fusion).

lbolla · Answer

If you use big enough vectors, Vector Unboxed might become impractical. For me pure (lazy) lists are quicker, if vector size > 50000000:

import System.TimeIt

sumit :: IO ()
sumit = print . sum $ replicate 50000000 10

main :: IO ()
main = timeIt sumit

I get these times:

Unboxed Vectors
CPU time:   1.00s

List:
CPU time:   0.70s

Edit: I've repeated the benchmark using Criterion and making sumit pure. Code and results as follow:

Code:

import Criterion.Main

sumit :: Int -> Int
sumit m = sum $ replicate m 10

main :: IO ()
main = defaultMain [bench "sumit" $ nf sumit 50000000]

Results:

warming up
estimating clock resolution...
mean is 7.248078 us (80001 iterations)
found 24509 outliers among 79999 samples (30.6%)
  6044 (7.6%) low severe
  18465 (23.1%) high severe
estimating cost of a clock call...
mean is 68.15917 ns (65 iterations)
found 7 outliers among 65 samples (10.8%)
  3 (4.6%) high mild
  4 (6.2%) high severe

benchmarking sumit
collecting 100 samples, 1 iterations each, in estimated 46.07401 s
mean: 451.0233 ms, lb 450.6641 ms, ub 451.5295 ms, ci 0.950
std dev: 2.172022 ms, lb 1.674497 ms, ub 2.841110 ms, ci 0.950

It looks like print makes a big difference, as it should be expected!

Doing efficient Numerics in Haskell

Tags:

optimization

haskell

numerics

Fredrik

4 Answers

hammar

applicative

ertes

lbolla

Recent Activity

Donate For Us

Doing efficient Numerics in Haskell

Tags:

optimization

haskell

numerics

Fredrik

4 Answers

hammar

applicative

ertes

lbolla

Related questions

Recent Activity

Donate For Us