Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA Thrust slow when operating large vectors on my machine

Tags:

c++

c

cuda

thrust

I'm a CUDA beginner and reading on some thrust tutorials.I write a simple but terribly organized code and try to figure out the acceleration of thrust.(is this idea correct?). I try to add two vectors (with 10000000 int) to another vector, by adding array on cpu and adding device_vector on gpu.

Here is the thing:

#include <iostream>
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

#define N 10000000
int main(void)
{
    float time_cpu;
    float time_gpu;
    int *a = new int[N];
    int *b = new int[N];
    int *c = new int[N];
    for(int i=0;i<N;i++)
    {
        a[i]=i;
        b[i]=i*i;
    }
    clock_t start_cpu,stop_cpu;
    start_cpu=clock();
    for(int i=0;i<N;i++)
    {
        c[i]=a[i]+b[i];
    }
    stop_cpu=clock();   
    time_cpu=(double)(stop_cpu-start_cpu)/CLOCKS_PER_SEC*1000;
    std::cout<<"Time to generate (CPU):"<<time_cpu<<std::endl;
    thrust::device_vector<int> X(N);
    thrust::device_vector<int> Y(N);
    thrust::device_vector<int> Z(N);
    for(int i=0;i<N;i++)
    {
        X[i]=i;
        Y[i]=i*i;
    }
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start,0);       
    thrust::transform(X.begin(), X.end(),
        Y.begin(),
        Z.begin(),
        thrust::plus<int>());
    cudaEventRecord(stop,0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime,start,stop);
    std::cout<<"Time to generate (thrust):"<<elapsedTime<<std::endl;
    cudaEventDestroy(start);
    cudaEventDestroy(stop); 
    getchar();
    return 0;
}

The CPU results appear really fast, But gpu runs REALLY slow on my machine(i5-2320,4G,GTX 560 Ti), CPU time is about 26,GPU time is around 30! Did I just do the thrust wrong with stupid errors in my code? or was there a deeper reason?

As a C++ rookie, I checked my code over and over and still got a slower time on GPU with thrust, so I did some experiments to show the difference of calculating vectorAdd with five different approaches. I use windows API QueryPerformanceFrequency() as unified time measurement method.

Each of the experiments looks like this:

f = large_interger.QuadPart;  
QueryPerformanceCounter(&large_interger);  
c1 = large_interger.QuadPart; 

for(int j=0;j<10;j++)
{
    for(int i=0;i<N;i++)//CPU array adding
    {
        c[i]=a[i]+b[i];
    }
}
QueryPerformanceCounter(&large_interger);  
c2 = large_interger.QuadPart;  
printf("Time to generate (CPU array adding) %lf ms\n", (c2 - c1) * 1000 / f);

and here is my simple __global__ function for GPU array adding:

__global__ void add(int *a, int *b, int *c)
{
    int tid=threadIdx.x+blockIdx.x*blockDim.x;
    while(tid<N)
    {
        c[tid]=a[tid]+b[tid];
        tid+=blockDim.x*gridDim.x;
    }
}

and the function is called as:

for(int j=0;j<10;j++)
{
    add<<<(N+127)/128,128>>>(dev_a,dev_b,dev_c);//GPU array adding
}   

I add vector a[N] and b[N] to vector c[N] for a loop of 10 times by:

  1. add array on CPU
  2. add std::vector on CPU
  3. add thrust::host_vector on CPU
  4. add thrust::device_vector on GPU
  5. add array on GPU. and here is the result

with N=10000000

and I get results:

  1. CPU array adding 268.992968ms
  2. CPU std::vector adding 1908.013595ms
  3. CPU Thrust::host_vector adding 10776.456803ms
  4. GPU Thrust::device_vector adding 297.156610ms
  5. GPU array adding 5.210573ms

And this confused me, I'm not familiar with the implementation of template library. Did the performance really differs so much between containers and raw data structures?

like image 829
Tony Avatar asked Sep 27 '12 14:09

Tony


2 Answers

Most of the execution time is being spent in your loop that is initializing X[i] and Y[i]. While this is legal, it's a very slow way to initialize large device vectors. It would be better to create host vectors, initialize them, then copy those to the device. As a test, modify your code like this (right after the loop where you are initializing the device vectors X[i] and Y[i]):

}  // this is your line of code
std::cout<< "Starting GPU run" <<std::endl;  //add this line
cudaEvent_t start, stop;   //this is your line of code

You will then see that the GPU timing results appear almost immediately after that added line prints out. So all of the time you're waiting is spent in initializing those device vectors directly from host code.

When I run this on my laptop, I get a CPU time of about 40 and a GPU time of about 5, so the GPU is running about 8 times faster than the CPU for the sections of code you are actually timing.

If you create X and Y as host vectors, and then create analogous d_X and d_Y device vectors, the overall execution time will be shorter, like so:

thrust::host_vector<int> X(N);     
thrust::host_vector<int> Y(N);     
thrust::device_vector<int> Z(N);     
for(int i=0;i<N;i++)     
{     
    X[i]=i;     
    Y[i]=i*i;     
}   
thrust::device_vector<int> d_X = X;
thrust::device_vector<int> d_Y = Y;

and change your transform call to:

thrust::transform(d_X.begin(), d_X.end(),      
    d_Y.begin(),      
    Z.begin(),      
    thrust::plus<int>()); 

OK so you've now indicated that the CPU run measurement is faster than the GPU measurement. Sorry I jumped to conclusions. My laptop is an HP laptop with a 2.6GHz core i7 and a Quadro 1000M gpu. I'm running centos 6.2 linux. A few comments: if you're running any heavy display tasks on your GPU, that can detract from performance. Also, when benchmarking these things it's common practice to use the same mechanism for comparison, you can use cudaEvents for both if you want, it can time CPU code the same as GPU code. Also, it's common practice with thrust to do a warm up run that is untimed, then repeat the test for a measurement, and likewise it's common practice to run the test 10 times or more in a loop, then divide to get an average. In my case, I can tell the clocks() measurement is pretty coarse because successive runs will give me 30, 40 or 50. On the GPU measurement I get something like 5.18256. Some of these things may help, but I can't say exactly why your results and mine differ so much (on the GPU side).

OK I did another experiment. The compiler will make a big difference on CPU side. I compiled with -O3 switch and the CPU time dropped to 0. Then I converted the CPU timing measurement from the clocks() method to cudaEvents, and I got a CPU measured time of 12.4 (with -O3 optimization) and still 5.1 on GPU side.

Your mileage will vary based on timing method and which compiler you are using on the CPU side.

like image 101
Robert Crovella Avatar answered Oct 06 '22 16:10

Robert Crovella


First, Y[i]=i*i; does not fit in an integer for 10M elements. Integers holds roughly 1e10 and your code needs 1e14.

Second, it looks like the timing of transform is correct and should be faster than the CPU, regardless of which library you're using. Robert's suggestion to initialize vectors on CPU and then transfer to GPU is a good one for this case.

Third, since we can't do the integer multiple, below is some simpler CUDA library code (using ArrayFire that I work on) to do similar with floats, for your benchmarking:

int n = 10e6;
array x = array(seq(n));
array y = x * x;
timer t = timer::tic();
array z = x + y;
af::eval(z); af::sync();
printf("elapsed seconds: %g\n", timer::toc( t));

Good luck!

like image 1
arrayfire Avatar answered Oct 06 '22 15:10

arrayfire