Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing Haskell Recursive Lists

Another Haskell optimization question from my previous. I need to generate a list recursively, similar to the fibs function found in many introductory Haskell articles:

generateSchedule :: [Word32] -> [Word32]
generateSchedule blkw = take 80 ws
    where
    ws          = blkw ++ zipWith4 gen (drop 13 ws) (drop 8 ws) (drop 2 ws) ws
    gen a b c d = rotate (a `xor` b `xor` c `xor` d) 1

The above function has overtaken as the most time and alloc -consuming function for me. The profiler gives me the following statistics:

COST CENTRE        MODULE             %time %alloc  ticks     bytes
generateSchedule   Test.Hash.SHA1     22.1   40.4   31        702556640

I thought of applying unboxed vectors to calculate the list but cannot figure a way to do it since the list is recursive. This would have a natural implementation in C but I do not see a way to make this faster (other than to unroll and write 80 lines of variable declarations). Any help?

Update: I actually did unroll it quickly to see if it helps. The code is here. It is ugly, and in fact it was slower.

COST CENTRE        MODULE             %time %alloc  ticks     bytes
generateSchedule   GG.Hash.SHA1       22.7   27.6   40        394270592
like image 878
Ana Avatar asked Nov 15 '11 19:11

Ana


2 Answers

import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed

generateSchedule :: [Word32] -> UArray Int Word32
generateSchedule ws0 = runSTUArray $ do
    arr <- unsafeNewArray_ (0,79)
    let fromList i [] = fill i 0
        fromList i (w:ws) = do
            unsafeWrite arr i w
            fromList (i+1) ws
        fill i j
          | i == 80 = return arr
          | otherwise = do
              d <- unsafeRead arr j
              c <- unsafeRead arr (j+2)
              b <- unsafeRead arr (j+8)
              a <- unsafeRead arr (j+13)
              unsafeWrite arr i (gen a b c d)
              fill (i+1) (j+1)
    fromList 0 ws0

will create an unboxed array corresponding to your list. It relies on the assumption that the list argument contains at least 14 and at most 80 items, otherwise it'll misbehave badly. I think it'll always be 16 items (64 bytes), so that should be safe for you. (But it's probably better to start filling directly from the ByteString rather than to construct an intermediate list.)

By strictly evaluating this before doing the hashing rounds, you save the switching between the list-construction and the hashing you have with the lazily construced list, that should reduce time needed. By using an unboxed array we avoid the allocation overhead of lists, which may further improve speed (but ghc's allocator is very fast, so don't expect too much impact from that).

In your hashing rounds, get the needed Word32 via unsafeAt array t to avoid unnecessary bounds-checking.

Addendum: Unrolling the creation of the list might be faster if you put a bang on each wn, though I'm not sure. Since you already have the code, adding bangs and checking isn't too much work, is it? I'm curious.

like image 110
Daniel Fischer Avatar answered Oct 16 '22 09:10

Daniel Fischer


We can use lazy arrays to get a halfway house between going straight mutable and using pure lists. You get the benefits of a recursive definition, but for that reason still pay the price of laziness and boxing -- though less so than with lists. The following code uses criterion to test two lazy array solutions (using standard arrays, and vectors) as well as the original list code and Daniel's mutable uarray code above:

module Main where
import Data.Bits
import Data.List
import Data.Word
import qualified Data.Vector as LV
import Data.Array.ST
import Data.Array.Unboxed
import qualified Data.Array as A
import Data.Array.Base
import Criterion.Main

gen :: Word32 -> Word32 -> Word32 -> Word32 -> Word32
gen a b c d = rotate (a `xor` b `xor` c `xor` d) 1

gss blkw = LV.toList v
    where v = LV.fromList $ blkw ++ rest
          rest = map (\i -> gen (LV.unsafeIndex v (i + 13))
                                (LV.unsafeIndex v (i + 8))
                                (LV.unsafeIndex v (i + 2))
                                (LV.unsafeIndex v i)
                     )
                 [0..79 - 14]

gss' blkw = A.elems v
    where v = A.listArray (0,79) $ blkw ++ rest
          rest = map (\i -> gen (unsafeAt v (i + 13))
                                (unsafeAt v (i + 8))
                                (unsafeAt v (i + 2))
                                (unsafeAt v i)
                     )
                 [0..79 - 14]

generateSchedule :: [Word32] -> [Word32]
generateSchedule blkw = take 80 ws
    where
    ws          = blkw ++ zipWith4 gen (drop 13 ws) (drop 8 ws) (drop 2 ws) ws

gs :: [Word32] -> [Word32]
gs ws = elems (generateSched ws)

generateSched :: [Word32] -> UArray Int Word32
generateSched ws0 = runSTUArray $ do
    arr <- unsafeNewArray_ (0,79)
    let fromList i [] = fill i 0
        fromList i (w:ws) = do
            unsafeWrite arr i w
            fromList (i+1) ws
        fill i j
          | i == 80 = return arr
          | otherwise = do
              d <- unsafeRead arr j
              c <- unsafeRead arr (j+2)
              b <- unsafeRead arr (j+8)
              a <- unsafeRead arr (j+13)
              unsafeWrite arr i (gen a b c d)
              fill (i+1) (j+1)
    fromList 0 ws0

args = [0..13]

main = defaultMain [
        bench "list"   $ whnf (sum . generateSchedule) args
       ,bench "vector" $ whnf (sum . gss) args
       ,bench "array"  $ whnf (sum . gss') args
       ,bench "uarray" $ whnf (sum . gs) args
       ]

I compiled the code with -O2 and -funfolding-use-threshold=256 to force lots of inlining.

The criterion benchmarks demonstrate that the vector solution is slightly better, and the array solution slightly better still, but that the unboxed mutable solution still wins by a landslide:

benchmarking list
mean: 8.021718 us, lb 7.720636 us, ub 8.605683 us, ci 0.950
std dev: 2.083916 us, lb 1.237193 us, ub 3.309458 us, ci 0.950

benchmarking vector
mean: 6.829923 us, lb 6.725189 us, ub 7.226799 us, ci 0.950
std dev: 882.3681 ns, lb 76.20755 ns, ub 2.026598 us, ci 0.950

benchmarking array
mean: 6.212669 us, lb 5.995038 us, ub 6.635405 us, ci 0.950
std dev: 1.518521 us, lb 946.8826 ns, ub 2.409086 us, ci 0.950

benchmarking uarray
mean: 2.380519 us, lb 2.147896 us, ub 2.715305 us, ci 0.950
std dev: 1.411092 us, lb 1.083180 us, ub 1.862854 us, ci 0.950

I ran some basic profiling too, and noticed that the lazy/boxed array solutions did slightly better than the list solution, but again significantly worse than the unboxed array approach.

like image 39
sclv Avatar answered Oct 16 '22 11:10

sclv