I'm having trouble processing large arrays (more than 65536 elements) in C++ AMP. I'm using C++ amp to calculate the normal, tangent and bitangent vectors for a list of polygons. The input consists of an array of positions (3 floats per position), and array of uv-coordinates (2 floats per vertex). In my parallel_for_each function, I compute the normals, tangents and bitangents (1 of each per group of 3 vertices). I write these back to arrays (encapsulated in array_view's). The algorithm looks as follows:
concurrency::extent<2> ePositions(positionsVector.size() / 3, 3);
concurrency::array_view<const float, 2> positions(ePositions, positionsVector);
concurrency::extent<2> eUVs(uvsVector.size() / 2, 2);
concurrency::array_view<const float, 2> UVs(eUVs, uvsVector);
concurrency::extent<2> eNormalDirections(normalDirectionsVector.size() / 3, 3);
concurrency::array_view<float, 2> normalDirections(eNormalDirections, normalDirectionsVector);
normalDirections.discard_data();
concurrency::extent<2> eTangentDirections(tangentDirectionsVector.size() / 3, 3);
concurrency::array_view<float, 2> tangentDirections(eTangentDirections, tangentDirectionsVector);
tangentDirections.discard_data();
concurrency::extent<2> eBitangentDirections(bitangentDirectionsVector.size() / 3, 3);
concurrency::array_view<float, 2> bitangentDirections(eBitangentDirections, bitangentDirectionsVector);
bitangentDirections.discard_data();
concurrency::parallel_for_each(eNormalDirections.tile<1, 3>(), [=](concurrency::tiled_index<1, 3> t_idx) restrict(amp)
{
< ... calculate the normals, tangents and bitangents and write them back ... >
}
normalDirections.synchronize();
tangentDirections.synchronize();
bitangentDirections.synchronize();
The original data is contained in the positionsVector and the uvsVector. The output is stored in normalDirectionsVector, tangentDirectionsVector and bitangentDirectionsVector. Three positions (and associated uv-pairs) form one polygon. As only one normal, tangent and bitangent is needed per polygon, the size of the output vectors is three times smaller than the size of the input vectors. All vectors are encapsulated in array_view's in the first code block.
The algorithm works fine, as long as the number of normals to calculate is smaller than 65536. As soon as I need 65536 or more normals, I get the following exception:
concurrency::parallel_for_each (tiling): unsupported compute domain, the extent of dimension 0 of the compute domain (65536) exceeds the exclusive limit (65536)
As the geometry I'd like to process consists of more than 65536 polygons, this limitation is a problem for me. I can't imagine C++ AMP is limited to the processing of less than 65536 elements. I'd therefore want to know what mistake I'm making in my approach, and how I can process arrays of more than 65536 elements.
Most GPUs have at least a GB of global memory, array
and array_view
both store data in global memory. In the case of array_view
this is automatically synchronized with data in the host (CPU) memory. They also have tile_static
memory which is much more limited. In this case I don't believe you are running into any memory related limits.
The compute domain is the extent
passed to the parallel_for_each
and describes the number of threads being used on the GPU. GPUs can only execute a limited number of total threads. This is the limit you have hit that is described in the error message. Altering the number of dimensions of the compute domain will not solve your issue, it's the total number of threads that is the issue regardless of how they are arranged. This is a general limitation of the GPU hardware (you will find similar limits with CUDA also).
You have a couple of approaches to fixing this issue.
1) You could break your calculation down into chunks that are smaller than the total thread limit. This may have the additional advantage of allowing you to hide copy overhead with the previous chunk's compute.
2) Have each thread in the compute domain calculate results for more than one polygon. This will allow you to increase the amount of work done by each thread which may improve the efficiency of the overall algorithm if it is actually constrained by the data transfers.
3) A combination of 1 & 2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With