Each GPU device (AMD, NVidea, or any other) is split into several Compute Units (MultiProcessors), each of which has a fixed number of cores (VertexShaders/StreamProcessors). So, one has (Compute Units) x (VertexShaders/compute unit)
simultaneous processors to compute with, but there is only a small fixed amount of __local
memory (usually 16KB or 32KB) available per MultiProcessor. Hence, the exact number of these multiprocessors matters.
Now my questions:
CL_DEVICE_MAX_COMPUTE_UNITS
? Can I deduce it from specification sheets such as http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units?__local
memory per MP there is available on a GPU before buying it? Of course I can request CL_DEVICE_LOCAL_MEM_SIZE
on a computer that runs it, but I don't see how I can deduce it from even an individual detailed specifications sheet such as http://www.amd.com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx#3? CL_DEVICE_LOCAL_MEM_SIZE
? Price doesn't really matter, but 64KB (or larger) would give a clear benefit for the application I'm writing, since my algorithm is completely parallelizable, but also highly memory-intensive with random access pattern within each MP (iterating over edges of graphs).CL_DEVICE_MAX_COMPUTE_UNITS
should give you the number of ComputeUnits, otherwise you can glance it from appropriate manuals (the AMD opencl programming guide and the Nvidia OpenCL programming guide)For your usage of local memory you should however remember that local memory is allocated per workgroup (and only accessible for a workgroup), while a Compute Unit can typically sustain more then one workgroup. So if your algorithm allocated the whole local memory to one workgroup you will not be able to use achieve the maximum amount of parallelity. Also note that since local memory is banked random access will lead to alot of bank conflicts and warp serializations. So your algorithm might not parallize quite as good as you think it will (or maybe it will, just mentioning the possibility).
With a Fermi based card your best bet might be to count on the caches instead of explicit local memory, if all your workgroups operate on the same data (I don't know how to switch the L1/local Memory configuration though).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With