Has Kepler 2x or 4x the bandwidth of Fermi while accessing shared memory?
Programming guide states: "Each bank has a bandwidth of 32 bits per two clock cycles" (for 2.X), and "Each bank has a bandwidth of 64 bits per clock cycle" (3.X), so 4x is implied?
On Fermi, each SM has 32 banks delivering 32 bits on every two clock cycles.
On Kepler, each SMX has 32 banks delivering 64 bits on every clock cycle. However since Kepler's SMX was fundamentally redesigned to be energy efficient, and since running fast clocks draws a lot of power, Kepler operates from a much slower core clock. Check out the Inside Kepler talk from GTC, about 8 minutes in, for more information.
So the answer to the question is that Kepler has ~2x, not 4x.
The next version of the documents (CUDA 5.0) should explain this better.
As given in
Programming Guide 4.2: Shared memory has 16 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per two clock cycles.
Kepler Whitepaper: The shared memory bandwidth for 64b and larger load operations is also doubled compared to the Fermi SM, to 256B per core clock.
For small load operations, 4X it is.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With