Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing Haskell Inner Loops

Still working on my SHA1 implementation in Haskell. I've now got a working implementation and this is the inner loop:

iterateBlock' :: Int -> [Word32] -> Word32 -> Word32 -> Word32 -> Word32 -> Word32 -> [Word32]
iterateBlock' 80 ws a b c d e    = [a, b, c, d, e]
iterateBlock' t (w:ws) a b c d e = iterateBlock' (t+1) ws a' b' c' d' e'
    where
    a' = rotate a 5 + f t b c d + e + w + k t
    b' = a
    c' = rotate b 30
    d' = c
    e' = d

The profiler tells me that this function takes 1/3 of the runtime of my implementation. I can think of no way to further optimize it other than maybe inlining the temp variables but I believe -O2 will do that for me anyway.

Can anyone see a significant optimization that can be further applied?

FYI the k and f calls are below. They are so simple I don't think there is a way to optimize these other. Unless the Data.Bits module is slow?

f :: Int -> Word32 -> Word32 -> Word32 -> Word32
f t b c d
    | t <= 19   = (b .&. c) .|. ((complement b) .&. d)
    | t <= 39   = b `xor` c `xor` d
    | t <= 59   = (b .&. c) .|. (b .&. d) .|. (c .&. d)
    | otherwise = b `xor` c `xor` d

k :: Int -> Word32
k t
    | t <= 19   = 0x5A827999
    | t <= 39   = 0x6ED9EBA1
    | t <= 59   = 0x8F1BBCDC
    | otherwise = 0xCA62C1D6
like image 363
Ana Avatar asked Nov 14 '11 21:11

Ana


1 Answers

Looking at the core produced by ghc-7.2.2, the inlining works out well. What doesn't work so well is that in each iteration a couple of Word32 values are first unboxed, to perform the work, and then reboxed for the next iteration. Unboxing and re-boxing can cost a surprisingly large amount of time (and allocation). You can probably avoid that by using Word instead of Word32. You couldn't use rotate from Data.Bits then, but would have to implement it yourself (not hard) to have it work also on 64-bit systems. For a' you would have to manually mask out the high bits.

Another point that looks suboptimal is that in each iteration t is compared to 19, 39 and 59 (if it's large enough), so that the loop body contains four branches. It will probably be faster if you split iterateBlock' into four loops (0-19, 20-39, 40-59, 60-79) and use constants k1, ..., k4, and four functions f1, ..., f4 (without the t parameter) to avoid branches and have smaller code-size for each loop.

And, as Thomas said, using a list for the block data isn't optimal, an unboxed Word array/vector would probably help too.

With the bang patterns, the core looks much better. Two or three less-than-ideal points remain.

                      (GHC.Prim.narrow32Word#
                         (GHC.Prim.plusWord#
                            (GHC.Prim.narrow32Word#
                               (GHC.Prim.plusWord#
                                  (GHC.Prim.narrow32Word#
                                     (GHC.Prim.plusWord#
                                        (GHC.Prim.narrow32Word#
                                           (GHC.Prim.plusWord#
                                              (GHC.Prim.narrow32Word#
                                                 (GHC.Prim.or#
                                                    (GHC.Prim.uncheckedShiftL# sc2_sEn 5)
                                                    (GHC.Prim.uncheckedShiftRL# sc2_sEn 27)))
                                              y#_aBw))
                                        sc6_sEr))
                                  y#1_XCZ))
                            y#2_XD6))

See all these narrow32Word#? They're cheap, but not free. Only the outermost is needed, there may be a bit to harvest by hand-coding the steps and using Word.

Then the comparisons of t with 19, ..., they appear twice, once to determine the k constant, and once for the f transform. The comparisons alone are cheap, but they cause branches and without them, further inlining may be possible. I expect a bit could be gained here too.

And still, the list. That means w can't be unboxed, the core could be simpler if w were unboxable.

like image 131
Daniel Fischer Avatar answered Sep 21 '22 14:09

Daniel Fischer