My environment is
I am having trouble figuring out why my GPU is crashing with the "uncorrectable ECC error encountered". This error only occurs when i use 512 threads or more. I can't post the kernel, but i will try to describe what it does.
In general, the kernel takes a number of parameters and produces 2 complex matricies defined by the thread size, M and another number, N. So the returned matrices will be of size MxN. A typical configuration is 512x512, but each number is independent and can vary up or down. The kernel works when the numbers are 256x256.
Each thread (kernel) extracts a 999 size vector out of a 2D array based on the thread id, ie size 999xM, then cycles through the row (0 .. N-1) of the output matrices for calculation. A number of intermediate parameters are calculated, only using pow, sin and cos among the + - * / operators. To calculate one of the output matrices an additional loop needs to be executed to sum up the contribution of the 999 vector that was extracted earlier. This loop does some intermediate calculations to determine a range of values that will allow contribution. The contribution is then scaled by a factor determined by the cos and sine values of a calculated fractional value. This is where it crashes. If i stick in a constant value or 1.0 or any other for that matter, the kernel executes without trouble. however, when only one of the calls (cos or sine) is included, the kernel crashes.
Some psuedocode follows:
kernel()
{
/* Extract 999 vector from 2D array 999xM - one 999 vector for each thread. */
for (int i = 0; i < 999; i++)
{
    .....
}
/* Cycle through the 2nd dimension of the output matricies */
for (int j = 0; j < N; j++)
{
    /* Calculate some intermediate variables */
    /* Calculate the real and imaginary components of the first output matrix */
    /* real = cos(value), imaginary = sin(value) */
    /* Construct the first output matrix from some intermediate variables and the real and imaginary components */
    /* Calculate some more intermediate variables */
    /* cycle through the extracted vector (0 .. 998) */
    for (int k = 0; k < 999; k++)
    {
        /* Calculate some more intermediate variables */
        /* Determine the range of allowed values to contribute to the second output matrix. */
        /* Calculate the real and imaginary components of the second output matrix */
        /* real = cos(value), imaginary = sin(value) */
        /* This is were it crashes, unless real and imaginary are constant values (1.0) */
        /* Sum up the contributions of the extracted vector to the second output matrix */
     }
     /* Construct the Second output matrix from some intermediate variables and the real and imaginary components */
}
}
I thought this could be due to a register limit, but the occupancy calculator indicates that this is not the case, I'm using less than the 32,768 registers with 512 threads. Can anyone give any suggestions as to what the cause of this could be?
Here is the ptasx info:
ptxas info    : Compiling entry function '_Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_' for 'sm_20' 
ptxas info    : Function properties for _Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_ 
8056 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 
ptxas info    : Function properties for __internal_trig_reduction_slowpathd 
40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 
ptxas info    : Used 53 registers, 232 bytes cmem[0], 144 bytes cmem[2], 28 bytes cmem[16]
tmpxft_00001d70_00000000-3_MexFunciton.cudafe1.cpp 
"Uncorrectable ECC error" usually refers to a hardware failure. ECC is Error Correcting Code, a means to detect and correct errors in bits stored in RAM. A stray cosmic ray can disrupt one bit stored in RAM every once in a great while, but "uncorrectable ECC error" indicates that several bits are coming out of RAM storage "wrong" - too many for the ECC to recover the original bit values.
This could mean that you have a bad or marginal RAM cell in your GPU device memory.
Marginal circuits of any kind may not fail 100%, but are more likely to fail under the stress of heavy use - and associated rise in temperature.
There are diagnostic utilities floating around to stress-test all the RAM banks of your PC to confirm or pinpoint which chip is failing, but I don't know of an analog for testing the device RAM banks of the GPU.
If you have access to another machine with a GPU of similar capability, try running your app on that machine to see how it behaves. If you don't get the ECC error on the second machine, this confirms that the problem is almost certainly in the hardware of the first machine. If you get the same ECC error on the second machine, then ignore everything I've written here and continue looking for your software bug. Unless your code is actually causing hardware damage, the chances of two machines having the same hardware failure are extremely small.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With