Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Repa 3 performance and correct usage of 'now'

There is a basic monad question in here, unrelated to Repa, plus several Repa-specific questions.

I am working on a library using Repa3. I am having trouble getting efficient parallel code. If I make my functions return delayed arrays, I get excruciatingly slow code that scales very well up to 8 cores. This code takes over 20GB of memory per the GHC profiler, and runs several orders of magnitude slower than the basic Haskell unboxed vectors.

Alternatively, if I make all of my functions return Unboxed manifest arrays (still attempting to use fusion within the functions, for example when I do a 'map'), I get MUCH faster code (still slower than using Haskell unboxed vectors) that doesn't scale at all, and in fact tends to get slightly slower with more cores.

Based on the FFT example code in Repa-Algorithms, it seems the correct approach is to always return manifest arrays. Is there ever a case where I should be returning delayed arrays?

The FFT code also makes plentiful use of the 'now' function. However, I get a type error when I try to use it in my code:

type Arr t r = Array t DIM1 r
data CycRingRepa m r = CRTBasis (Arr U r)
                     | PowBasis (Arr U r)

fromArray :: forall m r t. (BaseRing m r, Unbox r, Repr t r) => Arr t r -> CycRingRepa m r
fromArray = 
    let mval = reflectNum (Proxy::Proxy m)
    in \x ->
        let sh:.n = extent x
        in assert (mval == 2*n) PowBasis $ now $ computeUnboxedP $ bitrev x

The code compiles fine without the 'now'. With the 'now', I get the following error:

Couldn't match type r' withArray U (Z :. Int) r' `r' is a rigid type variable bound by the type signature for fromArray :: (BaseRing m r, Unbox r, Repr t r) => Arr t r -> CycRingRepa m r at C:\Users\crockeea\Documents\Code\LatticeLib\CycRingRepa.hs:50:1 Expected type: CycRingRepa m r Actual type: CycRingRepa m (Array U DIM1 r)

I don't think this is my problem. It would be helpful if someone could explain the how the Monad works in 'now'. By my best estimation, the monad seems to be creating a 'Arr U (Arr U r)'. I'm expecting a 'Arr U r', which would then match the data constructor pattern. What is going on and how do I fix this?

The type signatures are:

computeUnboxedP :: Fill r1 U sh e => Array r1 sh e -> Array U sh e
now :: (Shape sh, Repr r e, Monad m) => Array r sh e -> m (Array r sh e)

It would be helpful to have a better idea of when it is appropriate to use 'now'.

A couple other Repa questions: Should I explicitly call computeUnboxedP (as in the FFT example code), or should I use the more general computeP (because the unbox part is inferred by my data type)? Should I store delayed or manifest arrays in the data type CycRingRepa? Eventually I would also like this code to work with Haskell Integers. Will this require me to write new code that uses something other than U arrays, or could I write polymorphic code that creates U arrays for unbox types and some other array for Integers/boxed types?

I realize there are a lot of questions in here, and I appreciate any/all answers!

like image 339
crockeea Avatar asked Mar 15 '12 02:03

crockeea


2 Answers

Here's the source code for now:

now arr = do
  arr `deepSeqArray` return ()
  return arr

So it's really just a monadic version of deepSeqArray. You can use either of these to force evaluation, rather than hanging on to a thunk. This "evalulation" is different than the "computation" forced when computeP is called.

In your code, now doesn't apply, since you're not in a monad. But in this context deepSeqArray wouldn't help either. Consider this situation:

x :: Array U Int Double
x = ...

y :: Array U Int Double
y = computeUnboxedP $ map f x

Since y refers to x, we'd like to be sure x is computed before starting to compute y. If not, the available work won't be distributed correctly among the gang of threads. To get this to work out, it's better to write y as

y = deepSeqArray x . computeUnboxedP $ map f x

Now, for a delayed array, we have

deepSeqArray (ADelayed sh f) y = sh `deepSeq` f `seq` y

Rather than computing all the elements, this just makes sure the shape is computed, and reduces f to weak-head normal form.

As for manifest vs delayed arrays, there are certainly time delayed arrays are preferable.

multiplyMM arr brr
 = [arr, brr] `deepSeqArrays`
   A.sumP (A.zipWith (*) arrRepl brrRepl)
 where  trr             = computeUnboxedP $ transpose2D brr
        arrRepl         = trr `deepSeqArray` A.extend (Z :. All   :. colsB :. All) arr
        brrRepl         = trr `deepSeqArray` A.extend (Z :. rowsA :. All   :. All) trr
        (Z :. _     :. rowsA) = extent arr
        (Z :. colsB :. _    ) = extent brr

Here "extend" generates a new array by copying the values across some set of new dimensions. In particular, this means that

arrRepl ! (Z :. i :. j :. k) == arrRepl ! (Z :. i :. j' :. k)

Thankfully, extend produces a delayed array, since it would be a waste to go through the trouble of all this copying.

Delayed arrays also allow the possiblity of fusion, which is impossible if the array is manifest.

Finally, computeUnboxedP is just computeP with a specialized type. Giving computeUnboxedP explicitly might allow GHC to optimize better, and makes the code a little clearer.

like image 54
Chad Scherrer Avatar answered Nov 09 '22 05:11

Chad Scherrer


Repa 3.1 no longer requires the explict use of now. The parallel computation functions are all monadic, and automatically apply deepSeqArray to their results. The repa-examples package also contains a new implementation of matrix multiply that demonstrates their use.

like image 43
Ben Lippmeier Avatar answered Nov 09 '22 06:11

Ben Lippmeier