Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

break overhead vs control flag

I was using a naive prime generated function. This code takes about, 5.25 seconds to generate 10k prime numbers (device_primes[0] holds the number primes already found, the remaining position the prime numbers found).

_global__ void getPrimes(int *device_primes,int n)
{ 
    int c = 0;
    int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
    int num = thread_id+2;

    if (thread_id == 0) device_primes[0] = 1;
    __syncthreads();

    while(device_primes[0] < n)
    {
        for (c = 2; c <= num - 1; c++)
        { 
            if (num % c == 0) //not prime
            {
                break;
            }
        }
        if (c == num) //prime
        {
            int pos = atomicAdd(&device_primes[0],1);
            device_primes[pos] = num;
        }
        num += blockDim.x * gridDim.x; // Next number for this thread       
    }
}

I was just starting to optimize the code, and i made the follow modification, instead of :

for (c = 2; c <= num - 1; c++)
{ 
    if (num % c == 0) //not prime
         break;
}
 if (c == num) {...}

i have now :

   int prime = 1;

   ...
   for (c = 2; c <= num - 1 && prime; c++)
    { 
        if (num % c == 0) prime = 0; // not prime
    }
     if (prime) {...} // if prime

Now i can generate 10k in 0.707s. I was just wondering why such speed up with a this simple modification, is break that bad?

like image 627
dreamcrash Avatar asked Nov 05 '12 16:11

dreamcrash


1 Answers

As Tony suggested, divergent code execution can cause major slow downs in gpu code, forcing some code to run in serial rather than parallel. In the slow version of the code above, threads that hit the break diverge from code that continues.

The cuda c programming guide is a good resource for gpu programming techniques. Here is what it says about control flow:

Any flow control instruction (if, switch, do, for, while) can significantly impact the effective instruction throughput by causing threads of the same warp to diverge (i.e., to follow different execution paths). If this happens, the different executions paths have to be serialized, increasing the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

Newer nvidia hardware and cuda versions can handle some branching a little better than older versions, but it's still best to avoid branching whenever possible.

like image 146
Eric Olson Avatar answered Nov 04 '22 07:11

Eric Olson