Although I have a good LSFR C implementation I thought I'd try the same in Haskell - just to see how it goes. What I came up with, so far, is two orders of magnitude slower than the C implementation, which begs the question: How can the performance be improved? Clearly, the bit-fiddling operations are the bottleneck, and the profiler confirms this.
Here's the baseline Haskell code using lists and Data.Bits
:
import Control.Monad (when)
import Data.Bits (Bits, shift, testBit, xor, (.&.), (.|.))
import System.Environment (getArgs)
import System.Exit (exitFailure, exitSuccess)
tap :: [[Int]]
tap = [
[], [], [], [3, 2],
[4, 3], [5, 3], [6, 5], [7, 6],
[8, 6, 5, 4], [9, 5], [10, 7], [11, 9],
[12, 6, 4, 1], [13, 4, 3, 1], [14, 5, 3, 1], [15, 14],
[16,15,13,4], [17, 14], [18, 11], [19, 6, 2, 1],
[20, 17], [21, 19], [22, 21], [23, 18],
[24,23,22,17], [25, 22], [26, 6, 2, 1], [27, 5, 2, 1],
[28, 25], [29, 27], [30, 6, 4, 1], [31, 28],
[32,22,2,1], [33,20], [34,27,2,1], [35,33],
[36,25], [37,5,4,3,2,1],[38,6,5,1], [39,35],
[40,38,21,19], [41,38], [42,41,20,19], [43,42,38,37],
[44,43,18,17], [45,44,42,41], [46,45,26,25], [47,42],
[48,47,21,20], [49,40], [50,49,24,23], [51,50,36,35],
[52,49], [53,52,38,37], [54,53,18,17], [55,31],
[56,55,35,34], [57,50], [58,39], [59,58,38,37],
[60,59], [61,60,46,45], [62,61,6,5], [63,62] ]
xor' :: [Bool] -> Bool
xor' = foldr xor False
mask :: (Num a, Bits a) => Int -> a
mask len = shift 1 len - 1
advance :: Int -> [Int] -> Int -> Int
advance len tap lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask len
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
main :: IO ()
main = do
args <- getArgs
when (null args) $ fail "Usage: lsfr <number-of-bits>"
let len = read $ head args
when (len < 8) $ fail "No need for LFSR"
let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0
if out == 0 then do
putStr "OK\n"
exitSuccess
else do
putStr "FAIL\n"
exitFailure
Basically it tests whether the LSFR defined in tap :: [[Int]]
for any given bit-length is of maximum-length. (More precisely, it just checks whether the LSFR reaches the initial state (zero) after 2n iterations.)
According to the profiler the most costly line is the feedback bit d0 = xor' $ map (testBit lfsr) tap'
.
What I've tried so far:
Data.Array
: Attempt abandoned because there's no foldl/rData.Vector
: Slightly faster than the baselineThe compiler options I use are: -O2
, LTS Haskell 8.12 (GHC-8.0.2)
.
The reference C++ program can be found on gist.github.com.
The Haskell code can't be expected (?) to run as fast as the C code, but two orders of magnitude is too much, there must be a better way to do the bit-fiddling.
Update: Results of applying the optimisations suggested in the answers
-O2 -fllvm
), the execution time goes down to 1.7s
-O2 -fllvm -optlc -mcpu=native
brings this to 0.73siterate
with iterate'
of @cirdec makes no difference when Thomas' code is used (both with the default 'native' backend and LLVM). However, it does make a difference when the baseline code is used.So, we've come from 100x to 8x to 1.09x, i.e. only 9% slower than C!
Note
The LLVM backend to GHC 8.0.2 requires LLVM 3.7. On Mac OS X this means installing this version with brew
and then symlinking opt
and llc
. See 7.10. GHC Backends.
Up Front Matters
For starters, I'm using GHC 8.0.1 on an Intel I5 ~2.5GHz, linux x86-64.
First Draft: Oh No! The slows!
Your starting code with parameter 25 runs:
% ghc -O2 orig.hs && time ./orig 25
[1 of 1] Compiling Main ( orig.hs, orig.o )
Linking orig ...
OK
./orig 25 7.25s user 0.50s system 99% cpu 7.748 total
So the time to beat is 77ms - two orders of magnitude better than this Haskell code. Lets dive in.
Issue 1: Shifty Code
I found a couple of oddities with the code. First was the use of shift
in high performance code. Shift supports both left and right shift and to do so it requires a branch. Lets kill that with more readable powers of two and such (shift 1 x
~> 2^x
and shift x 1
~> 2*x
):
% ghc -O2 noShift.hs && time ./noShift 25
[1 of 1] Compiling Main ( noShift.hs, noShift.o )
Linking noShift ...
OK
./noShift 25 0.64s user 0.00s system 99% cpu 0.637 total
(As you noted in the comments: Yes, this bears investigation. It might be that some oddity of the prior code was preventing a rewrite rule from firing and, as a result, much worse code resulted)
Issue 2: Lists Of Bits? Int operations save the day!
One change, one order of magnitude. Yay. What else? Well you have this awkward list of bit locations you're tapping that just seems like its begging for inefficiency and/or leans on fragile optimizations. At this point I'll note that hard-coding any one selection from that list results in really good performance (such as testBit lsfr 24 `xor` testBit lsfr 21
) but we want a more general fast solution.
I propose we compute the mask of all the tap locations then do a one-instruction pop count. To do this we only need a single Int
passed in to advance
instead of a whole list. The popcount instruction requires good assembly generation which requires llvm and probably -optlc-mcpu=native
or another instruction set selection that is non-pessimistic.
This step gives us pc
below. I've folded in the guard-removal of advance
that was mentioned in the comments:
let tp = sum $ map ((2^) . subtract 1) (tap !! len)
pc lfsr = fromEnum (even (popCount (lfsr .&. tp)))
mask = 2^len - 1
advance' :: Int -> Int
advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr
out :: Int
out = last $ take (2^len) $ iterate advance' 0
Our resulting performance is:
% ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
OK
./so 25 0.06s user 0.00s system 96% cpu 0.067 total
That's over two orders of magnitude from start to finish, so hopefully it matches your C. Finally, in deployed code it is actually really common to have Haskell packages with C bindings but this is often an educational exercise so I hope you had fun.
Edit: The now-available C++ code takes my system 0.10 (g++ -O3
) and 0.12 (clang++ -O3 -march=native
) seconds, so it seems we've beat our mark by a fair bit.
I suspect that the following line is building a large list-like thunk in memory before evaluating it.
let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0` is
Let's find out if I'm right, and if I am, we'll fix it. The first debugging step is to get an idea of the memory used by the program. To do this we're going to compile with the options -rtsopts
in addition to -O2
. This enables running the program with RTS options, inclusing +RTS -s
which outputs a small memory summary.
Running your program as lfsr 25 +RTS -s
I get the following output
OK
5,420,148,768 bytes allocated in the heap
6,705,977,216 bytes copied during GC
1,567,511,384 bytes maximum residency (20 sample(s))
357,862,432 bytes maximum slop
3025 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10343 colls, 0 par 2.453s 2.522s 0.0002s 0.0009s
Gen 1 20 colls, 0 par 2.281s 3.065s 0.1533s 0.7128s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.438s ( 1.162s elapsed)
GC time 4.734s ( 5.587s elapsed)
EXIT time 0.016s ( 0.218s elapsed)
Total time 6.188s ( 6.967s elapsed)
%GC time 76.5% (80.2% elapsed)
Alloc rate 3,770,538,273 bytes per MUT second
Productivity 23.5% of total user, 19.8% of total elapsed
That's a lot of memory used at once. It's very likely there's a big thunk building up in there somewhere.
I hypothesized that the thunk is being built in iterate (advance ...)
. If this is the case, we can try to reduce the thunk size by making advance
more strict in its lsfr
argument. This won't remove the spine of the thunk (the successive iterations), but it might reduce the size of the state that's built up as the spine's evaluated.
BangPatterns
is an easy way to make a function strict in an argument. f !x = ..
is shorthand for f x = seq x $ ...
{-# LANGUAGE BangPatterns #-}
advance :: Int -> [Int] -> Int -> Int
advance len tap = go
where
go !lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask len
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
Let's see what difference this makes ...
>lfsr 25 +RTS -s
OK
5,420,149,072 bytes allocated in the heap
6,705,979,368 bytes copied during GC
1,567,511,448 bytes maximum residency (20 sample(s))
357,862,448 bytes maximum slop
3025 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10343 colls, 0 par 2.688s 2.711s 0.0003s 0.0059s
Gen 1 20 colls, 0 par 2.438s 3.252s 0.1626s 0.8013s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.328s ( 1.146s elapsed)
GC time 5.125s ( 5.963s elapsed)
EXIT time 0.000s ( 0.226s elapsed)
Total time 6.484s ( 7.335s elapsed)
%GC time 79.0% (81.3% elapsed)
Alloc rate 4,081,053,418 bytes per MUT second
Productivity 21.0% of total user, 18.7% of total elapsed
None that's noticeable.
I guess it's the spine of that iterate (advance ...)
that's being built. After all, for the command I'm running the list would be 2^25
, or a little over 33 million items long. The list itself is probably being removed by list fusion, but the thunk for the last item of the list is over 33 million applications of advance ...
To solve this problem we need a strict version of iterate
so that the value is forced to an Int
before applying the advance
function again. This should keep the memory down to only a single lfsr
value at a time, along with the currently computed application of advance
.
Unfortunately, there isn't a strict iterate
in Data.List
. Here's one that doesn't give up on the list fusion that's providing other important (I think) performance optimizations to this problem.
{-# LANGUAGE BangPatterns #-}
import GHC.Base (build)
{-# NOINLINE [1] iterate' #-}
iterate' :: (a -> a) -> a -> [a]
iterate' f = go
where go !x = x : go (f x)
{-# NOINLINE [0] iterateFB' #-}
iterateFB' :: (a -> b -> b) -> (a -> a) -> a -> b
iterateFB' c f = go
where go !x = x `c` go (f x)
{-# RULES
"iterate'" [~1] forall f x. iterate' f x = build (\c _n -> iterateFB' c f x)
"iterateFB'" [1] iterateFB' (:) = iterate'
#-}
This is just iterate
from GHC.List
(along with all its rewrite rules), but made strict in the accumulated argument.
Equipped with a strict iterate, iterate'
, we can change the troublesome line to
let out = last $ take (shift 1 len) $ iterate' (advance len (tap!!len)) 0
I expect that this will perform much better. Let's see ...
>lfsr 25 +RTS -s
OK
3,758,156,184 bytes allocated in the heap
297,976 bytes copied during GC
43,800 bytes maximum residency (1 sample(s))
21,736 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7281 colls, 0 par 0.047s 0.008s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.750s ( 0.783s elapsed)
GC time 0.047s ( 0.008s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.797s ( 0.792s elapsed)
%GC time 5.9% (1.0% elapsed)
Alloc rate 5,010,874,912 bytes per MUT second
Productivity 94.1% of total user, 99.0% of total elapsed
This used 0.00002
times as much memory and ran 10 times as fast.
I don't know if this will improve on Thomas DeBuisson's answer that improves advance
but still leaves a lazy iterate advance'
in place. It would be easy to check; add the iterate'
code to that answer and use iterate'
in place of iterate
in that answer.
Does the compiler lift tap !! len
out of the loop? I suspect it does, but moving it out to guarantee this can't hurt:
let tap1 = tap !! len
let out = last $ take (shift 1 len) $ iterate (advance len tap1) 0
In the comments you say "2^len
is needed exactly once", but this is wrong. You do it each time in advance
. So you could try
advance len tap mask lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
-- in main
let tap1 = tap !! len
let numIterations = 2^len
let mask = numIterations - 1
let out = iterate (advance len tap1 mask) 0 !! (numIterations - 1)
(the compiler can't optimize last $ take ...
to !!
in general, because they are different for finite lists, but iterate
always returns an infinite one.)
You compared foldr
with foldl
, but foldl
is almost never what you need; since xor
always needs both arguments and is associative, foldl'
is very likely to be the right choice (the compiler can optimize it, but if there is any real difference between foldl
and foldr
and not just random variation, it might have failed in this case).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With