Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shared memory bandwidth Fermi vs Kepler GPU

Has Kepler 2x or 4x the bandwidth of Fermi while accessing shared memory?

Programming guide states: "Each bank has a bandwidth of 32 bits per two clock cycles" (for 2.X), and "Each bank has a bandwidth of 64 bits per clock cycle" (3.X), so 4x is implied?

like image 586
P Marecki Avatar asked Sep 10 '12 15:09

P Marecki


2 Answers

On Fermi, each SM has 32 banks delivering 32 bits on every two clock cycles.

On Kepler, each SMX has 32 banks delivering 64 bits on every clock cycle. However since Kepler's SMX was fundamentally redesigned to be energy efficient, and since running fast clocks draws a lot of power, Kepler operates from a much slower core clock. Check out the Inside Kepler talk from GTC, about 8 minutes in, for more information.

So the answer to the question is that Kepler has ~2x, not 4x.

The next version of the documents (CUDA 5.0) should explain this better.

like image 92
Tom Avatar answered Nov 09 '22 23:11

Tom


As given in

Programming Guide 4.2: Shared memory has 16 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per two clock cycles.

Kepler Whitepaper: The shared memory bandwidth for 64b and larger load operations is also doubled compared to the Fermi SM, to 256B per core clock.

For small load operations, 4X it is.

like image 45
Fr34K Avatar answered Nov 09 '22 22:11

Fr34K