Shared memory bandwidth Fermi vs Kepler GPU

Question

Has Kepler 2x or 4x the bandwidth of Fermi while accessing shared memory?

Programming guide states: "Each bank has a bandwidth of 32 bits per two clock cycles" (for 2.X), and "Each bank has a bandwidth of 64 bits per clock cycle" (3.X), so 4x is implied?

Tom · Accepted Answer

On Fermi, each SM has 32 banks delivering 32 bits on every two clock cycles.

On Kepler, each SMX has 32 banks delivering 64 bits on every clock cycle. However since Kepler's SMX was fundamentally redesigned to be energy efficient, and since running fast clocks draws a lot of power, Kepler operates from a much slower core clock. Check out the Inside Kepler talk from GTC, about 8 minutes in, for more information.

So the answer to the question is that Kepler has ~2x, not 4x.

The next version of the documents (CUDA 5.0) should explain this better.

Fr34K · Answer

As given in

Programming Guide 4.2: Shared memory has 16 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per two clock cycles.

Kepler Whitepaper: The shared memory bandwidth for 64b and larger load operations is also doubled compared to the Fermi SM, to 256B per core clock.

For small load operations, 4X it is.

Shared memory bandwidth Fermi vs Kepler GPU

Tags:

cuda

gpgpu

gpu

nvidia

P Marecki

2 Answers

Tom

Fr34K

Recent Activity

Donate For Us

Shared memory bandwidth Fermi vs Kepler GPU

Tags:

cuda

gpgpu

gpu

nvidia

P Marecki

2 Answers

Tom

Fr34K

Related questions

Recent Activity

Donate For Us