Mutable, (possibly parallel) Haskell code and performance tuning

Question

I have now implemented another SHA3 candidate, namely Grøstl. This is still work in progress (very much so), but at the moment a 224-bit version pass all KATs. So now I'm wondering about performance (again :->). The difference this time, is that I chose to more closely mirror the (optimized) C implementation, i.e. I made a port from C to Haskell. The optimized C version use table-lookups to implement the algorithm. Furthermore the code is heavily based on updating an array containing 64-bit words. Thus I chose to use mutable unboxed vectors in Haskell.

My Grøstl code can be found here: https://github.com/hakoja/SHA3/blob/master/Data/Digest/GroestlMutable.hs

Short description of the algorithm: It's a Merkle-Damgård construction, iterating a compression function (f512M in my code) as long as there are 512-bits blocks of message left. The compression function is very simple: it simply runs two different independent 512-bit permutations P and Q (permP and permQ in my code) and combines their output. Its these permutations which are implemented by lookup tables.

Q1) The first thing that bothers me is that the use of mutable vectors makes my code look really fugly. This is my first time writing any major mutable code in Haskell so I don't really know how to improve this. Any tips on how I might better strucure the monadic code would be welcome.

Q2) The second is performance. Actually It's not too bad, because at the moment the Haskell code is only 3 times slower. Using GHC-7.2.1 and compiling as such:

ghc -O2 -Odph -fllvm -optlo-O3 -optlo-loop-reduce -optlo-loop-deletion

the Haskell code uses 60s. on an input of ~1GB, while the C-version uses 21-22s. But there are some things I find odd:

(1) If I try to inline rnd512QM, the code takes 4 times longer, but if I inline rnd512PM nothing happens! Why is this happening? These two functions are virtually identical!

(2) This is maybe more difficult. I've been experimenting with executing the two permutations in parallel. But currently to no avail. This is one example of what I tried:

f512 h m = V.force outP `par` (V.force outQ `pseq` (V.zipWith3 xor3 h outP outQ))
   where xor3 x1 x2 x3 = x1 `xor` x2 `xor` x3
         inP = V.zipWith xor h m
         outP = permP inP
         outQ = permQ m

When checking the run-time statistics, and using ThreadScope, I noticed that the correct number of SPARKS was created, but almost none was actually converted to useful parallel work. Thus I gained nothing in speedup. My question then becomes:

Are the P and Q functions just too small for the runtime to bother to run in parallel?
If not, is my use of par and pseq (and possibly Vector.Unboxed.force) wrong?
Would I gain anything by switching to strategies? And how would I go about doing that?

Thank you so much for your time.

EDIT:

Sorry for not providing any real benchmark tests. The testing code in the repo was just intended for myself only. For those wanting to test the code out, you will need to compile main.hs, and then run it as:

./main "algorithm" "testvariant" "byte aligned"

For instance:

./main groestl short224 False

or

./main groestl e False

(e stands for "Extreme". It's the very long message provided with the NIST KATS).

sclv · Accepted Answer

I checked out the repo, but there's no simple benchmark to just run and play with, so my ideas are just from eyeballing the code. Numbering is unrelated to your questions.

1) I'm pretty sure force doesn't do what you want -- it actually forces a copy of the underlying vector.

2) I think the use of unsafeThaw and unsafeFreeze is sort of odd. I'd just put f512M in the ST monad and be done with it. Then run it something like so:

otherwise = \msg -> truncate G224 . outputTransformation . runST $ foldM f512M h0_224 (parseMessage dataBitLen 512 msg)

3) V.foldM' is sort of silly -- you can just use a normal (strict) foldM over a list -- folding over the vector in the second argument doesn't seem to buy anything.

4) i'm dubious about the bangs in columnM and for the unsafeReads.

Also...

a) I suspect that xoring unboxed vectors can probably be implemented at a lower level than zipWith, making use of Data.Vector internals.

b) However, it may be better not to do this as it could interfere with vector fusion.

c) On inspection, extractByte looks slightly inefficient? Rather than using fromIntegral to truncate, maybe use mod or quot and then a single fromIntegral to take you directly to an Int.

Thomas M. DuBuisson · Answer

Be sure to compile with -threaded -rtsopts and execute with +RTS -N2. Without that, you won't have more than one OS thread to perform computations.
Try to spark computations that are referred to elsewhere, otherwise they might be collected:

_

f512 h m = outP `par` (outQ `pseq` (V.zipWith3 xor3 h outP outQ))
   where xor3 x1 x2 x3 = x1 `xor` x2 `xor` x3
         inP = V.zipWith xor h m
         outP = V.force $ permP inP
         outQ = V.force $ permQ m

_

3) If you switch things up so parseBlock accepts strict bytestrings (or chunks and packs lazy ones when needed) then you can use Data.Vector.Storable and potentially avoid some copying.

Mutable, (possibly parallel) Haskell code and performance tuning

Tags:

performance

haskell

parallel-processing

mutable

hakoja

2 Answers

sclv

Thomas M. DuBuisson

Recent Activity

Donate For Us

Mutable, (possibly parallel) Haskell code and performance tuning

Tags:

performance

haskell

parallel-processing

mutable

hakoja

2 Answers

sclv

Thomas M. DuBuisson

Related questions

Recent Activity

Donate For Us