Shared memory is "striped" into banks. This leads to the whole issue of bank conflicts, as we all know.
Question: But how can you determine how many banks ("stripes") exist in shared memory?
(Poking around NVIDIA "devtalk" forums, it seems that per-block shared memory is "striped" into 16 banks. But how do we know this? The threads suggesting this are a few years old. Have things changed? Is it fixed on all NVIDIA CUDA-capable cards? Is there a way to determine this from the runtime API (I don't see it there, e.g. under cudaDeviceProp)? Is there a manual way to determine it at runtime?)
The warp size is 32 threads and the number of banks is also 32, so bank conflicts can occur between any threads in the warp.
Declare shared memory in CUDA C/C++ device code using the __shared__ variable declaration specifier. There are multiple ways to declare shared memory inside a kernel, depending on whether the amount of memory is known at compile time or at run time.
Memory management on a CUDA device is similar to how it is done in CPU programming. You need to allocate memory space on the host, transfer the data to the device using the built-in API, retrieve the data (transfer the data back to the host), and finally free the allocated memory.
As @RobertHarvey says, it's documented. The programming guide indicates 16 banks for compute capability 1.x, and 32 banks for compute capability 2.x and 3.x. You can thus make any decisions based on the compute capability (major version) returned in device properties.
The general link to the cuda on-line documentation is contained in the info link for the cuda tag.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With