In this trivial program to print all the numbers from 1 to 10000000, a Haskell version and a C version, why is the Haskell one so slow and what commands help to learn how to improve the Haskell program's performance?
Below is a report containing all the details necessary to reproduce my exciting event, the sources are printed when making the report including the source of the Makefile:
$ make -B report
cat Foo.hs
import Data.Foldable
main = traverse_ print [1..10000000]
cat Fooc.c
#include <stdio.h>
int main()
{
for (int n = 0; n < 10000000; ++n)
{
printf("%d\n", n+1);
}
}
ghc -O3 Foo.hs -o Foo
time ./Foo | tail -n1
3.45user 0.03system 0:03.49elapsed 99%CPU (0avgtext+0avgdata 4092maxresident)k
0inputs+0outputs (0major+290minor)pagefaults 0swaps
10000000
cc -O3 Fooc.c -o Fooc
time ./Fooc | tail -n1
0.63user 0.02system 0:00.66elapsed 99%CPU (0avgtext+0avgdata 1468maxresident)k
0inputs+0outputs (0major+63minor)pagefaults 0swaps
10000000
cat Makefile
.PHONY: printFoo printFooc printMakefile
printFoo: Foo.hs
cat $^
printFooc: Fooc.c
cat $^
printMakefile: Makefile
cat $^
Fooc: CFLAGS=-O3
Fooc: Fooc.c
Foo: Foo.hs
ghc -O3 $^ -o $@
.PHONY: timeFoo timeFooc
timeFoo: Foo
time ./$^ | tail -n1
timeFooc: Fooc
time ./$^ | tail -n1
.PHONY: report
report: printFoo printFooc timeFoo timeFooc printMakefile
On my system your Haskell code takes about 3.2 seconds.
N.B. your C code takes...
time ./fooc | tail -n1
ld: warning: directory not found for option '-L/opt/local/lib'
10000000
./fooc 0.92s user 0.03s system 33% cpu 2.863 total
tail -n1 2.85s user 0.01s system 99% cpu 2.865 total
So do note the difference of time a | b and what that means vs time (a | b).
Haskell is slow in part because (some of this is hypothesis)
print and the underlying putStrLn use String which is a linked list of characters.For 1, the packed variant using Text doesn't perform much differently, perhaps due to issue 2.
For 2, the ByteString variant (packed bytes instead of characters) is more representative of what your C program is doing:
-- Using functions from the Relude package
main = traverse_ putBSLn (show <$> [(1::Int)..10000000])
Results in
10000000
./foo 1.55s user 0.08s system 56% cpu 2.904 total
So the CPU time is closer to your C program, leading me to hypothesize that this difference is largely about the unnecessary UTF8 handling built into the routines you use by default in Haskell's prelude.
Dead-ends:
NoBuffering and large BlockBuffering with no luck.Text instead of String gave only the barest improvement.showed into Strings. This could be a win, I expect, if done well.EDIT: I can't believe I forgot about Builder, which is an optimized way to build bytestrings and, in some cases, it fuses well to reduce allocations. Builder is the underpinning of the above example I already showed but using it directly allows some manual optimization.
{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString.Builder
import System.IO (stdout)
import Data.Foldable
main :: IO ()
main = do
traverse_ (hPutBuilder stdout . (<>"\n") . intDec) [(1::Int)..10000000]
Performing at:
./foo 1.05s user 0.13s system 38% cpu 3.048 total
tail -n1 3.02s user 0.01s system 99% cpu 3.047 total
And indeed this is more efficient than the prior, many separate, calls to hPut because as hPutBuilder says:
This function is more efficient than hPut . toLazyByteString because in many cases no buffer allocation has to be done. Moreover, the results of several executions of short Builders are concatenated in the Handles buffer, therefore avoiding unnecessary buffer flushes.
So I should add: 4. Haskell was slow in this case because sometimes computations do not fuse and you end up with excess allocation, which isn't free.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With