I'm experimenting with the C++ AMP library in F# as a way of using the GPU to do work in parallel. However, the results I'm getting don't seem intuitive.
In C++, I made a library with one function that squares all the numbers in an array, using AMP:
extern "C" __declspec ( dllexport ) void _stdcall square_array(double* arr, int n)
{
// Create a view over the data on the CPU
array_view<double,1> dataView(n, &arr[0]);
// Run code on the GPU
parallel_for_each(dataView.extent, [=] (index<1> idx) restrict(amp)
{
dataView[idx] = dataView[idx] * dataView[idx];
});
// Copy data from GPU to CPU
dataView.synchronize();
}
(Code adapted from Igor Ostrovsky's blog on MSDN.)
I then wrote the following F# to compare the Task Parallel Library (TPL) to AMP:
// Print the time needed to run the given function
let time f =
let s = new Stopwatch()
s.Start()
f ()
s.Stop()
printfn "elapsed: %d" s.ElapsedTicks
module CInterop =
[<DllImport("CPlus", CallingConvention = CallingConvention.StdCall)>]
extern void square_array(float[] array, int length)
let options = new ParallelOptions()
let size = 1000.0
let arr = [|1.0 .. size|]
// Square the number at the given index of the array
let sq i =
do arr.[i] <- arr.[i] * arr.[i]
()
// Square every number in the array using TPL
time (fun() -> Parallel.For(0, arr.Length - 1, options, new Action<int>(sq)) |> ignore)
let arr2 = [|1.0 .. size|]
// Square every number in the array using AMP
time (fun() -> CInterop.square_array(arr2, arr2.Length))
If I set the array size to a trivial number like 10, it takes the TPL ~22K ticks to finish, and AMP ~10K ticks. That's what I expect. As I understand it, a GPU (hence AMP) should be better suited to this situation, where the work is broken into very small pieces, than the TPL.
However, if I increase the array size to 1000, the TPL now takes ~30K ticks and AMP takes ~70K ticks. And it just gets worse from there. For an array of size 1 million, AMP takes nearly 1000x as long as the TPL.
Since I expect the GPU (i.e. AMP) to be better at this kind of task, I'm wondering what I'm missing here.
My graphics card is a GeForce 550 Ti with 1GB, not a slouch as far as I know. I know there's overhead in using PInvoke to call into the AMP code, but I expect that to be a flat cost that is amortized over larger array sizes. I believe the array is passed by reference (though I could be wrong), so I don't expect any cost associated with copying that.
Thank you to everyone for your advice.
Transferring data back and forth between GPU and CPU takes time. You are most likely measuring your PCI Express bus bandwidth here. Squaring 1M of floats is piece of cake for a GPU.
It's also not a good idea to use the Stopwach
class to measure performance for AMP because GPU calls can happen asynchronously. In your case it is ok, but if you measure the compute part only (the parallel_for_each
) this won't work. I think you can use D3D11 performance counters for that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With