break overhead vs control flag

Question

I was using a naive prime generated function. This code takes about, 5.25 seconds to generate 10k prime numbers (device_primes[0] holds the number primes already found, the remaining position the prime numbers found).

_global__ void getPrimes(int *device_primes,int n)
{ 
    int c = 0;
    int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
    int num = thread_id+2;

    if (thread_id == 0) device_primes[0] = 1;
    __syncthreads();

    while(device_primes[0] < n)
    {
        for (c = 2; c <= num - 1; c++)
        { 
            if (num % c == 0) //not prime
            {
                break;
            }
        }
        if (c == num) //prime
        {
            int pos = atomicAdd(&device_primes[0],1);
            device_primes[pos] = num;
        }
        num += blockDim.x * gridDim.x; // Next number for this thread       
    }
}

I was just starting to optimize the code, and i made the follow modification, instead of :

for (c = 2; c <= num - 1; c++)
{ 
    if (num % c == 0) //not prime
         break;
}
 if (c == num) {...}

i have now :

   int prime = 1;

   ...
   for (c = 2; c <= num - 1 && prime; c++)
    { 
        if (num % c == 0) prime = 0; // not prime
    }
     if (prime) {...} // if prime

Now i can generate 10k in 0.707s. I was just wondering why such speed up with a this simple modification, is break that bad?

Eric Olson · Accepted Answer

As Tony suggested, divergent code execution can cause major slow downs in gpu code, forcing some code to run in serial rather than parallel. In the slow version of the code above, threads that hit the break diverge from code that continues.

The cuda c programming guide is a good resource for gpu programming techniques. Here is what it says about control flow:

Any flow control instruction (if, switch, do, for, while) can significantly impact the effective instruction throughput by causing threads of the same warp to diverge (i.e., to follow different execution paths). If this happens, the different executions paths have to be serialized, increasing the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

Newer nvidia hardware and cuda versions can handle some branching a little better than older versions, but it's still best to avoid branching whenever possible.

break overhead vs control flag

Tags:

performance

cuda

dreamcrash

1 Answers

Eric Olson

Recent Activity

Donate For Us

break overhead vs control flag

Tags:

performance

cuda

dreamcrash

1 Answers

Eric Olson

Related questions

Recent Activity

Donate For Us