CUDA periodic execution time

Tags:

cuda

I just started learning CUDA and I have a trouble interpreting my experiment results. I wanted to compare CPU vs GPU in a simple program that adds two vectors together. The code is following:

__global__ void add(int *a, int *b, int *c, long long n) {
    long long tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        c[tid] = a[tid] + b[tid];
    }
}

void add_cpu(int* a, int* b, int* c, long long n) {
    for (long long i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

void check_results(int* gpu, int* cpu, long long n) {
    for (long long i = 0; i < n; i++) {
        if (gpu[i] != cpu[i]) {
            printf("Different results!\n");
            return;
        }
    }
}

int main(int argc, char* argv[]) {
    long long n = atoll(argv[1]);
    int num_of_blocks = atoi(argv[2]);
    int num_of_threads = atoi(argv[3]);

    int* a = new int[n];
    int* b = new int[n]; 
    int* c = new int[n]; 
    int* c_cpu = new int[n];
    int *dev_a, *dev_b, *dev_c;
    cudaMalloc((void **) &dev_a, n * sizeof(int));
    cudaMalloc((void **) &dev_b, n * sizeof(int));
    cudaMalloc((void **) &dev_c, n * sizeof(int));
    for (long long i = 0; i < n; i++) {
        a[i] = i;
        b[i] = i * 2;
    }

    cudaMemcpy(dev_a, a, n * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, n * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_c, c, n * sizeof(int), cudaMemcpyHostToDevice);

    StopWatchInterface *timer=NULL;
    sdkCreateTimer(&timer);
    sdkResetTimer(&timer);
    sdkStartTimer(&timer);  

    add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);

    cudaDeviceSynchronize();
    sdkStopTimer(&timer);
    float time = sdkGetTimerValue(&timer);
    sdkDeleteTimer(&timer);

    cudaMemcpy(c, dev_c, n * sizeof(int), cudaMemcpyDeviceToHost);
    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);

    clock_t start = clock();
    add_cpu(a, b, c_cpu, n);
    clock_t end = clock();

    check_results(c, c_cpu, n);
    printf("%f %f\n", (double)(end - start) * 1000 / CLOCKS_PER_SEC, time);

    return 0;
}

I ran this code in a loop with a bash script:

for i in {1..2560}
do
    n="$((1024 * i))"
    out=`./vectors $n $i 1024`
    echo "$i $out" >> "./vectors.txt"
done

Where 2560 is maximum number of blocks that my GPU supports, and 1024 is the maximum number of threads in block. So I just ran it for maximum block size to the maximum problem size my GPU can handle, with a step of 1 block (1024 ints in vector).

Here is my GPU info:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 2070 SUPER"
  CUDA Driver Version / Runtime Version          11.3 / 11.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 8192 MBytes (8589934592 bytes)
  (040) Multiprocessors, (064) CUDA Cores/MP:    2560 CUDA Cores
  GPU Max Clock rate:                            1785 MHz (1.78 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.3, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS

After running the experiment I gathered the results and plotted them: Relation between execution time and vector size

So what bothers me is this 256 blocks-wide period in the GPU execution time. I have no clue why this happens. Why executing 512 blocks is much slower than executing 513 blocks of threads?

I also checked this with a constant number of blocks (2560) as well as with different block sizes and it always give this period of 256 * 1024 vector size (so for block size 512 its each 512 blocks, not each 256 blocks). So maybe this is something with memory, but I can't figure out what.

I would appreciate any ideas on why this is happening.

228

asked May 16 '21 18:05

zuroslav

1 Answers

This is by no means a complete or precise answer. However I believe the periodic pattern you are observing is at least partly due to a 1-time or first-time kernel launch overhead. Good benchmarking practice usually is to do something other than what you are doing. For example, run the kernel multiple times and take an average. Or do some other kind of statistical measurement.

When I run your code using your script on a GTX 960 GPU, I get the following graph (only plotting the GPU data, vertical axis is in milliseconds):

without warm-up

When I modify your code as follows:

cudaMemcpy(dev_c, c, n * sizeof(int), cudaMemcpyHostToDevice);
// next two lines added:
add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);
cudaDeviceSynchronize();

StopWatchInterface *timer=NULL;
sdkCreateTimer(&timer);
sdkResetTimer(&timer);
sdkStartTimer(&timer);  

add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);

cudaDeviceSynchronize();

Doing a "warm-up" run first, then timing the second run, I witness data like this:

with warm-up

So the data without the warm-up shows a periodicity. After the warm-up, the periodicity disappears. I conclude that the periodicity is due to some kind of 1-time or first-time behavior. Some typical things that might be in this category are caching effects and cuda "lazy" initialization effects (for example, the time taken to JIT-compile the GPU code, which is certainly happening in your case, or the time to load the GPU code into GPU memory). I won't be able to go farther with any explanation of what kind of first-time effect exactly is giving rise to the periodicity.

Another observation is that while my data shows an expected "average slope" to each graph, indicating that the kernel duration associated with 2560 blocks is approximately 5 times the kernel duration associated with 512 blocks, I don't see that kind of trend in your data. It ought to be there, however. Your GPU will "saturate" at about 40 blocks. Thereafter, the average kernel duration should increase in approximately a linear fashion, such that the kernel duration associated with 2560 blocks is 4-5x the kernel duration associated with 512 blocks. I can't explain your data in this respect at all, I suspect a graphing or data processing error, or else a characteristic in your environment (e.g. shared GPU with other users, broken CUDA install, etc.) that is not present in my environment, and which I'm unable to guess at.

Finally, my conclusion is that GPU "expected" behavior is more evident in the presence of good benchmarking techniques.

115

answered Oct 05 '22 17:10

Robert Crovella

Related questions
                            
                                Does clang++ 12 support C++20 std::construct_at?
                            
                                C++ 17 Metaprogramming recursive struct: enum or constexpr
                            
                                Lambda in for loop - static variable
                            
                                Can a constexpr function contain a label?
                            
                                Understanding of Scott Meyers' third example of std::weak_ptr
                            
                                Why does an unsigned int of bit-field become a signed int after shift operation in C++? [duplicate]
                            
                                Returning constructor arguments in braces?
                            
                                Convert "__m256 with random-bits" into float values of [0, 1] range
                            
                                C++: How call a function with type parameter on variadic template arguments?
                            
                                Template parameters of function type with auto return type arguments of previous template parameter types
                            
                                Partial specialization of variadic value template
                            
                                Why does Visual Studio compile this function correctly without optimisation, but incorrectly with optimisation?
                            
                                Can we assume that x == (int)sqrt(x * x) for all positive integers?
                            
                                C++20 Streams aka Ranges
                            
                                What is the difference between const std::vector<T> and std::vector<T> const?
                            
                                Is it possible to pass a dynamically allocated array to a function that requires an array reference?
                            
                                C++: What does a const reference to the return value of a function mean?
                            
                                How to detect whether a struct's member is a bit-field at compile time in C++?
                            
                                keeping shared_ptr use_count() at 1
                            
                                Placement new and delete in constexpr functions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With