I have the following simple code:
#include<stdio.h>
#define BLOCKSIZE_X 32
#define BLOCKSIZE_Y 1
int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void kernel0(float *d_a, float *d_b, const unsigned int M, const unsigned int N)
{
const int tidx = threadIdx.x + blockIdx.x * blockDim.x;
const int tidy = threadIdx.y + blockIdx.y * blockDim.y;
if ((tidx < M)&&(tidy < N)) {
d_b[tidy * M + tidx] = d_a[tidy * M + tidx];
}
}
void main()
{
const unsigned int M = 32;
const unsigned int N = 1;
float *d_a; cudaMalloc((void**)&d_a, M*N*sizeof(float));
float *d_b; cudaMalloc((void**)&d_b, M*N*sizeof(float));
dim3 dimGrid(iDivUp(M, BLOCKSIZE_X), iDivUp(N, BLOCKSIZE_Y));
dim3 dimBlock(BLOCKSIZE_X, BLOCKSIZE_Y);
kernel0<<<dimGrid, dimBlock>>>(d_a, d_b, M, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
cudaDeviceReset();
}
which executes the assignment between two arrays of 32 float
s. I'm trying to understand the relation between global memory coalesced accesses and global memory load/store efficiency as well as other metrics/events.
The Visual Profiler shows the following metrics:
Global Memory Load Efficiency = 50%
Global Memory Store Efficiency = 100%
The value of the Global Memory Load Efficiency surprises me. I would have expected 100%
efficiency in both the cases since I believe I'm performing a perfectly coalesced memory access. So my question is:
Why do I have a 50% Global Memory Load Efficiency when I'm performing a coalesced memory access and instead I have a 100% Global Memory Store Efficiency?
I have investigated also other metrics/events, which may be useful to be reported:
gld_inst_32bit = 32 (Number of 32-bit global memory load transactions)
gst_inst_32bit = 32 (Number of 32-bit global memory store transactions)
Indeed I'm requesting to load/write 32 float
s.
uncached global load transaction = 0 (Number of uncached global load transactions)
l1 global load miss = 2 (Number of global load misses in L1 cache)
The above two events seem to be contradictory, according to my (probably erroneous) understanding. In the case of a l1
cache miss, I would have expected the first event to be different from 0
.
gld_request = 1 (Number of executed global load instructions per warp in a SM)
gst_request = 1 (Number of executed global store instructions per warp in a SM)
which seem to be consistent with the fact that I'm performing a perfectly coalesced memory access.
The disassembled code is the following:
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R3, SR_CTAID.Y; /* 0x2c0000009800dc04 */
/*0010*/ S2R R4, SR_TID.Y; /* 0x2c00000088011c04 */
/*0018*/ IMAD R4, R3, c[0x0][0xc], R4; /* 0x2008400030311ca3 */
/*0020*/ S2R R0, SR_CTAID.X; /* 0x2c00000094001c04 */
/*0028*/ ISETP.LT.U32.AND P0, PT, R4, c[0x0][0x2c], PT; /* 0x188e4000b041dc03 */
/*0030*/ S2R R2, SR_TID.X; /* 0x2c00000084009c04 */
/*0038*/ IMAD R0, R0, c[0x0][0x8], R2; /* 0x2004400020001ca3 */
/*0040*/ ISETP.LT.U32.AND P0, PT, R0, c[0x0][0x28], P0; /* 0x18804000a001dc03 */
/*0048*/ @!P0 BRA.U 0x78; /* 0x40000000a000a1e7 */
/*0050*/ @P0 IMAD R2, R4, c[0x0][0x28], R0; /* 0x20004000a04080a3 */
/*0058*/ @P0 ISCADD R0, R2, c[0x0][0x20], 0x2; /* 0x4000400080200043 */
/*0060*/ @P0 ISCADD R2, R2, c[0x0][0x24], 0x2; /* 0x4000400090208043 */
/*0068*/ @P0 LD R0, [R0]; /* 0x8000000000000085 */
/*0070*/ @P0 ST [R2], R0; /* 0x9000000000200085 */
/*0078*/ EXIT; /* 0x8000000000001de7 */
EDIT
My configuration: CUDA 6.5, GeForce GT540M, Windows 7.
If I increase M
from 32
to 64
to launch two blocks and make the two available Streaming Multiprocessors of my card busy, then the Global Memory Load Efficiency turns to 100%
and these are the new metrics/events:
gld_inst_32bit = 64
gst_inst_32bit = 64
uncached global load transaction = 0
l1 global load miss = 2
gld_request = 2
gst_request = 2
The increases of gld_inst_32bit
, gst_inst_32bit
, gld_request
and gst_request
are expected and consistent since now I'm loading7storing 64
float
s and 2
global memory load/store coalesced requests. But I do not still understand how uncached global load transaction
and l1 global load miss
can keep the same, while the global memory load throughput changes to provide 100%
efficiency.
EDIT
Results on a Kepler K20c for M=32
:
Global Memory Load Efficiency = 100%
Global Memory Store Efficiency = 100%
gld_inst_32bit = 64
gst_inst_32bit = 64
gld_request = 1
gst_request = 1
uncached global load transaction = 1
l1 global load miss = 0
l1 global load hit = 0
Now the Visual Profiler reports an uncached global load transaction but no l1
global load miss.
EDIT
I have investigated a little bit more into this issue, increasing the value of M
and keeping the BLOCKSIZE_X
fixed.
When the number of blocks is odd, namely the load on the two Streaming Multiprocessors of my GT540M card is unbalanced, then the Global Memory Load Efficiency is less than 100%
, otherwise it is 100%
in the even case. The Global Memory Load Efficiency slowly tends to 100%
as long as the number of blocks is increased in the odd case.
If I disable the L1
cache by compiling with -Xptxas -dlcm=cg
, as suggested by @Jez, then the Global Memory Load Efficiency is always equal to 100%
, as it is the Global Memory Store Efficiency. I know that global memory stores do not use L1
cache, but only L2.
Some pictures showing, for different values of M
, the behavior of the Global Memory Load Efficiency
M=32
M=64
M=96
M=128
M=160
M=192
Note that M
is an integer multiple of 32
to enable loading of an entire cache line by a single warp.
By disabling the L1
, I have:
M=32
M=64
M=96
EDIT - RESULTS FOR A TESLA C2050
M = 32 33.3%
M = 64 28.6%
M = 96 42.9%
M = 128 57.1%
M = 160 71.4%
M = 192 85.7%
M = 224 100%
M = 256 114%
M = 288 90%
Again, if I disable the L1
cache, I have 100%
Global Memory Load Efficiency in all the cases.
ACCURACY OF COUNTERS
The NVIDIA profilers can collect both raw counters and metrics. Many of the metrics require the kernel to be executed many times. Ideally, the profiler would be able to collect all raw counters for a metric in a single pass but this is simply not possible given the limitations of the performance monitor systems.
On Fermi architecture the HWPM system used to collect L1 statistics for Global Memory Load Efficiency and Global Memory Store Efficiency can only observe 1 L1 unit per GPC. For GF100 (C2050) this equates to 25% observation.
The profiler will not be able to provide accurate results if the workload does not fully fill the machine and the work per unit is the same per pass.
On Kepler architecture the HWPM system can collect L1 statistics from every L1 but still has some limitations for L2 which can lead to small discrepancies.
On Maxwell architecture the memory system is significantly different as global, local, and surface requests now all go through the unified L1/TEX cache.
CACHED VS. UNCACHED
In the Fermi architecture all global load/stores are through the L1 cache. Uncached global loads/stores are still through L1, use an LSU transaction, and require a tag lookup to invalidate the cache line. Atomics are the only form of global access through L1 that do not invalidate the L1 cache.
The Kepler architecture has a few small changes. By default most chips do not cache global memory accesses so all global loads are uncached. On GK110 and GK208 chips it is possible to use the new LDG instruction to load global data through the TEX cache.
Cached and uncached global load transactions from the SM to L1 are 128 bytes.
Cached global load transactions from L1 to L2 are done as 4 32B requests.
Uncached global load transactions from L1 to L2 are done as the minimal number of 32B requests.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With