I've got this haskell file, compiled with ghc -O2
(ghc 7.4.1), and takes 1.65 sec on my machine
import Data.Bits main = do print $ length $ filter (\i -> i .&. (shift 1 (i `mod` 4)) /= 0) [0..123456789]
The same algorithm in C, compiled with gcc -O2
(gcc 4.6.3), runs in 0.18 sec.
#include <stdio.h> void main() { int count = 0; const int max = 123456789; int i; for (i = 0; i < max; ++i) if ((i & (1 << i % 4)) != 0) ++count; printf("count: %d\n", count); }
Update I thought it might be the Data.Bits
stuff going slow, but surprisingly if I remove the shifting and just do a straight mod
, it actually runs slower at 5.6 seconds!?!
import Data.Bits main = do print $ length $ filter (\i -> (i `mod` 4) /= 0) [0..123456789]
whereas the equivalent C runs slightly faster at 0.16 sec:
#include <stdio.h> void main() { int count = 0; const int max = 123456789; int i; for (i = 0; i < max; ++i) if ((i % 4) != 0) ++count; printf("count: %d\n", count); }
The two pieces of code do very different things.
import Data.Bits main = do print $ length $ filter (\i -> i .&. (shift 1 (i `mod` 4)) /= 0) [0..123456789]
creates a list of 123456790 Integer
(lazily), takes the remainder modulo 4 of each (involving first a check whether the Integer
is small enough to wrap a raw machine integer, then after the division a sign-check, since mod
returns non-negative results only - though in ghc-7.6.1, there is a primop for that, so it's not as much of a brake to use mod
as it was before), shifts the Integer
1 left the appropriate number of bits, which involves a conversion to "big" Integer
s and a call to GMP, takes the bitwise and with i
- yet another call to GMP - and checks whether the result is 0, which causes another call to GMP or a conversion to small integer, not sure what GHC does here. Then, if the result is nonzero, a new list cell is created where that Integer
is put in, and consumed by length
. That's a lot of work done, most of which unnecessarily complicated due to the defaulting of unspecified number types to Integer
.
The C code
#include <stdio.h> int main(void) { int count = 0; const int max = 123456789; int i; for (i = 0; i < max; ++i) if ((i & (1 << i % 4)) != 0) ++count; printf("count: %d\n", count); return 0; }
(I took the liberty of fixing the return type of main
), does much much less. It takes an int
, compares it to another, if smaller, takes the bitwise and of the first int
with 3(1), shifts the int
1 to the left the appropriate number of bits, takes the bitwise and of that and the first int
, and if nonzero increments another int
, then increments the first. Those are all machine ops, working on raw machine types.
If we translate that code to Haskell,
module Main (main) where import Data.Bits maxNum :: Int maxNum = 123456789 loop :: Int -> Int -> Int loop acc i | i < maxNum = loop (if i .&. (1 `shiftL` (i .&. 3)) /= 0 then acc + 1 else acc) (i+1) | otherwise = acc main :: IO () main = print $ loop 0 0
we get a much closer result:
C, gcc -O3: count: 30864196 real 0m0.180s user 0m0.178s sys 0m0.001s Haskell, ghc -O2: 30864196 real 0m0.247s user 0m0.243s sys 0m0.003s Haskell, ghc -O2 -fllvm: 30864196 real 0m0.144s user 0m0.140s sys 0m0.003s
GHC's native code generator isn't a particularly good loop optimiser, so using the llvm backend makes a big difference here, but even the native code generator doesn't do too badly.
Okay, I have done the optimisation of replacing a modulus calculation with a power-of-two modulus with a bitwise and by hand, GHC's native code generator doesn't do that (yet), so with ```rem4`` instead of
.&. 3`, the native code generator produces code that takes (here) 1.42 seconds to run, but the llvm backend does that optimisation, and produces the same code as with the hand-made optimisation.
Now, let us turn to gspr's question
While LLVM didn't have a massive effect on the original code, it really did on the modified (I'd love to learn why...).
Well, the original code used lists, llvm doesn't know too well what to do with these, it can't transform that code into loops. The modified code uses Integer
s andInt
s and the vector
package rewrites the code to loops, so llvm does know how to optimise that well, and that shows.
(1) Assuming a normal binary computer. That optimisation is done by ordinary C compilers even without any optimisation flag, except on the very rare platforms where a div
instruction is faster than a shift.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With