I'm writing a toy implementation of a rainbow table in Haskell. The main datastructure is a strict Map h c
, containing a large amount of pairs, generated from random values c
:
import qualified Data.Map as M
import System.Random
table :: (RandomGen g, Random c) => Int -> g -> Map h c
table n = M.fromList . map (\c -> (chain c, c)) . take n . randoms
where chain
is very expensive to compute. The part that dominates the computation time is embarrassingly parallel, so I would expect to get a quasi-linear speedup in the number of cores if it runs in parallel.
However, I would like the computed pairs to be added to the table straight away, rather than accumulated in a list in memory. It should be noted that collisions may occur, and in that case, the redundant chains should be dropped as soon as possible. Heap profiling confirms that this is the case.
I've found parMap
from Control.Parallel.Strategies
, and tried to apply it to my table-building function:
table n = M.fromList . parMap (evalTuple2 rseq rseq) (\c -> (chain c, c)) . take n . randoms
but, running with -N
, I get to 1.3 core usage at best. Heap profiling indicates, at least, that the intermediate list does not reside in memory, but '-s' also reports 0 sparks created. How is this possible with my usage of parMap
? What is the proper way to do this ?
EDIT: chain
is defined as:
chain :: (c -> h) -> [h -> c] -> c -> h
chain h = h . flip (foldl' (flip (.h)))
where (c -> h)
is the target hash function, from cleartext to hash,
and [h -> c]
is a family of reducer functions. I want the implementation to stay generic over c
and h
, but for benchmarking I use strict bytestrings for both.
In the Map/Reduce model, "parallelism" is achieved via a "split/sort/merge/join" process and is described as follows. A MapReduce Job starts from a predefined set of Input data (usually sitting in some directory of a distributed file system). A master daemon (which is a central co-ordinator) is started and gets the job configuration.
Data level parallelism that is present in applications is exploited by vector architectures, SIMD style of architectures or SIMD extensions and Graphics Processing Units. Vector architectures support vector registers, vector instructions, deeply pipelined functional units and pipelined memory access.
Unfortunately, one hazard of the modern hype is the misconception that parallel means fast, or that every task yields itself well to parallelization. On the contrary, many tasks can’t be parallelized in a manner that improves performance, and often parallel code will run slower than serial code.
Here is what I came up with. Let me know how the benchmarks work out:
#!/usr/bin/env stack
{- stack --resolver lts-14.1 script --optimize
--package scheduler
--package containers
--package random
--package splitmix
--package deepseq
-}
{-# LANGUAGE BangPatterns #-}
import Control.DeepSeq
import Control.Scheduler
import Data.Foldable as F
import Data.IORef
import Data.List (unfoldr)
import Data.Map.Strict as M
import System.Environment (getArgs)
import System.Random as R
import System.Random.SplitMix
-- for simplicity
chain :: Show a => a -> String
chain = show
makeTable :: Int -> SMGen -> (SMGen, M.Map String Int)
makeTable = go M.empty
where go !acc i gen
| i > 0 =
let (c, gen') = R.random gen
in go (M.insert (chain c) c acc) (i - 1) gen'
| otherwise = (gen, acc)
makeTablePar :: Int -> SMGen -> IO (M.Map String Int)
makeTablePar n0 gen0 = do
let gens = unfoldr (Just . splitSMGen) gen0
gensState <- initWorkerStates Par (\(WorkerId wid) -> newIORef (gens !! wid))
tables <-
withSchedulerWS gensState $ \scheduler -> do
let k = numWorkers (unwrapSchedulerWS scheduler)
(q, r) = n0 `quotRem` k
forM_ ((if r == 0 then [] else [r]) ++ replicate k q) $ \n ->
scheduleWorkState scheduler $ \genRef -> do
gen <- readIORef genRef
let (gen', table) = makeTable n gen
writeIORef genRef gen'
table `deepseq` pure table
pure $ F.foldl' M.union M.empty tables
main :: IO ()
main = do
[n] <- fmap read <$> getArgs
gen <- initSMGen
print =<< makeTablePar n gen
Few notes on implementation:
random
, it is hella slow, splitmix
is x200 fastermakeTable
, if you want duplicate results to be discarded right away, then manual loop or unfold is required. But since we need the generator returned, I opted for the manual loop.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With