What is the meaning of the statement
// create arrays of 1M elements
const int num_elements = 1<<20;
in the code below? Is it specific to CUDA or is can this be used in Standard C as well?
When I printf 'ed num_elements I got num_elements==1048576
Which turns out to be 2^20. So is the << operator a shorthand for exponentiation in C?
// This example demonstrates parallel floating point vector
// addition with a simple __global__ function.
#include <stdlib.h>
#include <stdio.h>
// this kernel computes the vector sum c = a + b
// each thread performs one pair-wise addition
__global__ void vector_add(const float *a,
const float *b,
float *c,
const size_t n)
{
// compute the global element index this thread should process
unsigned int i = threadIdx.x + blockDim.x * blockIdx.x;
// avoid accessing out of bounds elements
if(i < n)
{
// sum elements
c[i] = a[i] + b[i];
}
}
int main(void)
{
// create arrays of 1M elements
const int num_elements = 1<<20;
// compute the size of the arrays in bytes
const int num_bytes = num_elements * sizeof(float);
// points to host & device arrays
float *device_array_a = 0;
float *device_array_b = 0;
float *device_array_c = 0;
float *host_array_a = 0;
float *host_array_b = 0;
float *host_array_c = 0;
// malloc the host arrays
host_array_a = (float*)malloc(num_bytes);
host_array_b = (float*)malloc(num_bytes);
host_array_c = (float*)malloc(num_bytes);
// cudaMalloc the device arrays
cudaMalloc((void**)&device_array_a, num_bytes);
cudaMalloc((void**)&device_array_b, num_bytes);
cudaMalloc((void**)&device_array_c, num_bytes);
// if any memory allocation failed, report an error message
if(host_array_a == 0 || host_array_b == 0 || host_array_c == 0 ||
device_array_a == 0 || device_array_b == 0 || device_array_c == 0)
{
printf("couldn't allocate memory\n");
return 1;
}
// initialize host_array_a & host_array_b
for(int i = 0; i < num_elements; ++i)
{
// make array a a linear ramp
host_array_a[i] = (float)i;
// make array b random
host_array_b[i] = (float)rand() / RAND_MAX;
}
// copy arrays a & b to the device memory space
cudaMemcpy(device_array_a, host_array_a, num_bytes, cudaMemcpyHostToDevice);
cudaMemcpy(device_array_b, host_array_b, num_bytes, cudaMemcpyHostToDevice);
// compute c = a + b on the device
const size_t block_size = 256;
size_t grid_size = num_elements / block_size;
// deal with a possible partial final block
if(num_elements % block_size) ++grid_size;
// launch the kernel
vector_add<<<grid_size, block_size>>>(device_array_a, device_array_b, device_array_c, num_elements);
// copy the result back to the host memory space
cudaMemcpy(host_array_c, device_array_c, num_bytes, cudaMemcpyDeviceToHost);
// print out the first 10 results
for(int i = 0; i < 10; ++i)
{
printf("result %d: %1.1f + %7.1f = %7.1f\n", i, host_array_a[i], host_array_b[i], host_array_c[i]);
}
// deallocate memory
free(host_array_a);
free(host_array_b);
free(host_array_c);
cudaFree(device_array_a);
cudaFree(device_array_b);
cudaFree(device_array_c);
}
No, the << operator is the bit shift operator. It takes the bits of a number, such as 00101 and shifts them over to the left n places, which has the effect of multiplying a number by a power of two. So x << y is x * 2^y. This a result of the way numbers are stored internally in computers, which is binary.
For example, the number 1 is, when stored as a 32-bit integer in 2's complement (which it is):
00000000000000000000000000000001
When you do
1 << 20
You are taking all the 1's in that binary representation and moving them over 20 places:
00000000000100000000000000000000
Which is 2^20. This also works for sign-magnitude representation, 1's complement, etc.
Another example, if you take the representation of 5:
00000000000000000000000000000101
And do 5 << 1, you get
00000000000000000000000000001010
Which is 10, or 5 * 2^1.
Conversely, the >> will divide by a power of 2 by moving the bits over to the right n places.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With