The Haskell repa library is for automatically parallel array computation on CPUs. The accelerate library is automatic data parallelism on GPUs. The APIs are quite similar, with identical representations of N-dimensional arrays. One can even switch between accelerate and repa arrays with fromRepa
and toRepa
in Data.Array.Accelerate.IO
:
fromRepa :: (Shapes sh sh', Elt e) => Array A sh e -> Array sh' e
toRepa :: Shapes sh sh' => Array sh' e -> Array A sh e
There are multiple backends for accelerate, including LLVM, CUDA and FPGA (see Figure 2 of http://www.cse.unsw.edu.au/~keller/Papers/acc-cuda.pdf). I've spotted a repa backend for accelerate, though the library doesn't appear to be maintained. Given that the repa and accelerate programming models are similar, I am hopeful that there is an elegant way of switching between them i.e. functions written once can be executed with repa's R.computeP or with one of accelerate's backends e.g. with the CUDA run function.
Take a simple image processing thresholding function. If a grayscale pixel value is less than 50, then it is set to 0, otherwise it retains its value. Here's what it does to a pumpkin:
The following code presents repa and accelerate implementations:
module Main where
import qualified Data.Array.Repa as R
import qualified Data.Array.Repa.IO.BMP as R
import qualified Data.Array.Accelerate as A
import qualified Data.Array.Accelerate.IO as A
import qualified Data.Array.Accelerate.Interpreter as A
import Data.Word
-- Apply threshold over image using accelerate (interpreter)
thresholdAccelerate :: IO ()
thresholdAccelerate = do
img <- either (error . show) id `fmap` A.readImageFromBMP "pumpkin-in.bmp"
let newImg = A.run $ A.map evalPixel (A.use img)
A.writeImageToBMP "pumpkin-out.bmp" newImg
where
-- *** Exception: Prelude.Ord.compare applied to EDSL types
evalPixel :: A.Exp A.Word32 -> A.Exp A.Word32
evalPixel p = if p > 50 then p else 0
-- Apply threshold over image using repa
thresholdRepa :: IO ()
thresholdRepa = do
let arr :: IO (R.Array R.U R.DIM2 (Word8,Word8,Word8))
arr = either (error . show) id `fmap` R.readImageFromBMP "pumpkin-in.bmp"
img <- arr
newImg <- R.computeP (R.map applyAtPoint img)
R.writeImageToBMP "pumpkin-out.bmp" newImg
where
applyAtPoint :: (Word8,Word8,Word8) -> (Word8,Word8,Word8)
applyAtPoint (r,g,b) =
let [r',g',b'] = map applyThresholdOnPixel [r,g,b]
in (r',g',b')
applyThresholdOnPixel x = if x > 50 then x else 0
data BackendChoice = Repa | Accelerate
main :: IO ()
main = do
let userChoice = Repa -- pretend this command line flag
case userChoice of
Repa -> thresholdRepa
Accelerate -> thresholdAccelerate
The implementations of thresholdAccelerate
and thresholdRepa
are very similar. Is there an elegant way to write array processing functions once, then opt for multicore CPUs (repa) or GPUs (accelerate) in a switch programmatically? I can think of choosing my import in accordance with whether I want CPU or GPU i.e. to import either Data.Array.Accelerate.CUDA
or Data.Array.Repa
to execute an action of type Acc a
with:
run :: Arrays a => Acc a -> a
Or, to use a type class e.g. something roughly like:
main :: IO ()
main = do
let userChoice = Repa -- pretend this is a command line flag
action <- case userChoice of
Repa -> applyThreshold :: RepaBackend ()
Accelerate -> applyThreshold :: CudaBackend ()
action
Or is it the case that, for each parallel array function I wish to express for both CPUs and GPUs, I must implement it twice --- once with the repa library and again with the accelerate library?
The short answer is that, at the moment, you unfortunately need to write both versions.
However, we are working on CPU support for Accelerate, which will obviate the need for the Repa version of the code. In particular, Accelerate very recently gained a new LLVM-based backend that targets both GPUs and CPUs: https://github.com/AccelerateHS/accelerate-llvm
This new backend is still incomplete, buggy, and experimental, but we are planning to make it into a viable alternative to the current CUDA backend.
I thought about this a year and few months ago while designing yarr
. At that time there were serious issues with type families inference or something like this (I don't remember exactly) which prevented to implement such unifying wrapper of vector
, repa
, yarr
, accelerate
, etc. both efficiently and allowing not to write too many explicit type signatures, or implement it in principle (I don't remember).
That was GHC 7.6. I don't know if there meaningful improvements in GHC 7.8 in this field. Theoretically I didn't see any problems, thus we can expect such stuff someday, in short or long time, when GHC will be ready.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With