I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04. Here is the minimum code that can reproduce the result: CUDASample.cuh <pre class="prettyprint"><code>class CUDASample{ public: void AddOneToVector(std::vector<int> &in); }; </code></pre> CUDASample.cu <pre class="prettyprint"><code>__global__ static void CUDAKernelAddOneToVector(int *data) { const int x = blockIdx.x * blockDim.x + threadIdx.x; const int y = blockIdx.y * blockDim.y + threadIdx.y; const int mx = gridDim.x * blockDim.x; data[y * mx + x] = data[y * mx + x] + 1.0f; } void CUDASample::AddOneToVector(std::vector<int> &in){ int *data; cudaMallocManaged(reinterpret_cast<void **>(&data), in.size() * sizeof(int), cudaMemAttachGlobal); for (std::size_t i = 0; i < in.size(); i++){ data[i] = in.at(i); } dim3 blks(in.size()/(16*32),1); dim3 threads(32, 16); CUDAKernelAddOneToVector<<<blks, threads>>>(data); cudaDeviceSynchronize(); for (std::size_t i = 0; i < in.size(); i++){ in.at(i) = data[i]; } cudaFree(data); } </code></pre> Main.cpp <pre class="prettyprint"><code>std::vector<int> v; for (int i = 0; i < 8192000; i++){ v.push_back(i); } CUDASample cudasample; cudasample.AddOneToVector(v); </code></pre> The only difference is the NVCC flag, which for the Pascal Titan X is: <pre class="prettyprint"><code>-gencode arch=compute_61,code=sm_61-std=c++11; </code></pre> and for the old Maxwell Titan X is: <pre class="prettyprint"><code>-gencode arch=compute_52,code=sm_52-std=c++11; </code></pre> EDIT: Here are the results for running NVIDIA Visual Profiling. For the old Maxwell Titan, the time for memory transfer is around 205 ms, and the kernel launch is around 268 us. <img src="https://i.stack.imgur.com/RfKQN.png" alt="enter image description here"> For the Pascal Titan, the time for memory transfer is around 202 ms, and the kernel launch is around an insanely long 8343 us, which makes me believe something is wrong. <img src="https://i.stack.imgur.com/4S8Z1.png" alt="enter image description here"> I further isolate the problem by replacing cudaMallocManaged into good old cudaMalloc and did some profiling and observe some interesting result. CUDASample.cu <pre class="prettyprint"><code>__global__ static void CUDAKernelAddOneToVector(int *data) { const int x = blockIdx.x * blockDim.x + threadIdx.x; const int y = blockIdx.y * blockDim.y + threadIdx.y; const int mx = gridDim.x * blockDim.x; data[y * mx + x] = data[y * mx + x] + 1.0f; } void CUDASample::AddOneToVector(std::vector<int> &in){ int *data; cudaMalloc(reinterpret_cast<void **>(&data), in.size() * sizeof(int)); cudaMemcpy(reinterpret_cast<void*>(data),reinterpret_cast<void*>(in.data()), in.size() * sizeof(int), cudaMemcpyHostToDevice); dim3 blks(in.size()/(16*32),1); dim3 threads(32, 16); CUDAKernelAddOneToVector<<<blks, threads>>>(data); cudaDeviceSynchronize(); cudaMemcpy(reinterpret_cast<void*>(in.data()),reinterpret_cast<void*>(data), in.size() * sizeof(int), cudaMemcpyDeviceToHost); cudaFree(data); } </code></pre> For the old Maxwell Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 264 us. <img src="https://i.stack.imgur.com/G2YD6.png" alt="enter image description here"> For the Pascal Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 194 us, which actually results in the performance increase I am hoping to see... <img src="https://i.stack.imgur.com/7LX8v.png" alt="enter image description here"> Why is Pascal GPU so slow on running CUDA kernels when cudaMallocManaged is used? It will be a travesty if I have to revert all my existing code that uses cudaMallocManaged into cudaMalloc. This experiment also shows that the memory transfer time using cudaMallocManaged is a lot slower than using cudaMalloc, which also feels like something is wrong. If using this results in a slow run time even the code is easier, this should be unacceptable because the whole purpose of using CUDA instead of plain C++ is to speed things up. What am I doing wrong and why am I observing this kind of result?

Under CUDA 8 with Pascal GPUs, managed memory data migration under a unified memory (UM) regime will generally occur differently than on previous architectures, and you are experiencing the effects of this. (Also see note at the end about CUDA 9 updated behavior for windows.) With previous architectures (e.g. Maxwell), managed allocations used by a particular kernel call will be migrated all at once, upon launch of the kernel, approximately as if you called <code>cudaMemcpy</code> to move the data yourself. With CUDA 8 and Pascal GPUs, data migration occurs via demand-paging. At kernel launch, by default, no data is explicitly migrated to the device(*). When the GPU device code attempts to access data in a particular page that is not resident in GPU memory, a page fault will occur. The net effect of this page fault is to: <ol> <li>Cause the GPU kernel code (the thread or threads that accessed the page) to stall (until step 2 is complete)</li> <li>Cause that page of memory to be migrated from the CPU to the GPU</li> </ol> This process will be repeated as necessary, as GPU code touches various pages of data. The sequence of operations involved in step 2 above involves some latency as the page fault is processed, in addition to the time spent to actually move the data. Since this process will move data a page at a time, it may be signficantly less efficient than moving all the data at once, either using <code>cudaMemcpy</code> or else via the pre-Pascal UM arrangement that caused all data to be moved at kernel launch (whether it was needed or not, and regardless of when the kernel code actually needed it). Both approaches have their pros and cons, and I don't wish to debate the merits or various opinions or viewpoints. The demand-paging process enables a great many important features and capabilities for Pascal GPUs. This particular code example, however, does not benefit. This was anticipated, and so the recommended use to bring the behavior in line with previous (e.g. maxwell) behavior/performance is to precede the kernel launch with a <code>cudaMemPrefetchAsync()</code> call. You would use the CUDA stream semantics to force this call to complete prior to the kernel launch (if the kernel launch does not specify a stream, you can pass NULL for the stream parameter, to select the default stream). I believe the other parameters for this function call are pretty self-explanatory. With this function call before your kernel call, covering the data in question, you should not observe any page-faulting in the Pascal case, and the profile behavior should be similar to the Maxwell case. As I mentioned in the comments, if you had created a test case that involved two kernel calls in sequence, you would have observed that the 2nd call runs at approximately full speed even in the Pascal case, since all of the data has already been migrated to the GPU side through the first kernel execution. Therefore, the use of this prefetch function should not be considered mandatory or automatic, but should be used thoughtfully. There are situations where the GPU may be able to hide the latency of page-faulting to some degree, and obviously data already resident on the GPU does not need to be prefetched. Note that the "stall" referred to in step 1 above is possibly misleading. A memory access by itself does not trigger a stall. But if the data requested is actually needed for an operation, e.g. a multiply, then the warp will stall at the multiply operation, until the necessary data becomes available. A related point, then, is that demand-paging of data from host to device in this fashion is just another "latency" that the GPU can possibly hide in it's latency-hiding architecture, if there is sufficient other available "work" to attend to. As an additional note, in CUDA 9, the demand-paging regime for pascal and beyond is only available on linux; the previous support for Windows advertised in CUDA 8 has been dropped. See here. On windows, even for Pascal devices and beyond, as of CUDA 9, the UM regime is the same as maxwell and prior devices; data is migrated to the GPU en-masse, at kernel launch. (*) The assumption here is that data is "resident" on the host, i.e. already "touched" or initialized in CPU code, after the managed allocation call. The managed allocation itself creates data pages associated with the device, and when CPU code "touches" these pages, the CUDA runtime will demand-page the necessary pages to be resident in host memory, so that the CPU can use them. If you perform an allocation but never "touch" the data in CPU code (an odd situation, probably) then it will actually already be "resident" in device memory when the kernel runs, and the observed behavior will be different. But that is not the case in view for this particular example/question. Additional information is available in this blog article.

Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged

Tags:

c++

cuda

gpgpu

nvidia

nvcc

I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04.

Here is the minimum code that can reproduce the result:

CUDASample.cuh

class CUDASample{
 public:
  void AddOneToVector(std::vector<int> &in);
};

CUDASample.cu

__global__ static void CUDAKernelAddOneToVector(int *data)
{
  const int x  = blockIdx.x * blockDim.x + threadIdx.x;
  const int y  = blockIdx.y * blockDim.y + threadIdx.y;
  const int mx = gridDim.x * blockDim.x;

  data[y * mx + x] = data[y * mx + x] + 1.0f;
}

void CUDASample::AddOneToVector(std::vector<int> &in){
  int *data;
  cudaMallocManaged(reinterpret_cast<void **>(&data),
                    in.size() * sizeof(int),
                    cudaMemAttachGlobal);

  for (std::size_t i = 0; i < in.size(); i++){
    data[i] = in.at(i);
  }

  dim3 blks(in.size()/(16*32),1);
  dim3 threads(32, 16);

  CUDAKernelAddOneToVector<<<blks, threads>>>(data);

  cudaDeviceSynchronize();

  for (std::size_t i = 0; i < in.size(); i++){
    in.at(i) = data[i];
  }

  cudaFree(data);
}

Main.cpp

std::vector<int> v;

for (int i = 0; i < 8192000; i++){
  v.push_back(i);
}

CUDASample cudasample;

cudasample.AddOneToVector(v);

The only difference is the NVCC flag, which for the Pascal Titan X is:

-gencode arch=compute_61,code=sm_61-std=c++11;

and for the old Maxwell Titan X is:

-gencode arch=compute_52,code=sm_52-std=c++11;

EDIT: Here are the results for running NVIDIA Visual Profiling.

For the old Maxwell Titan, the time for memory transfer is around 205 ms, and the kernel launch is around 268 us. enter image description here

For the Pascal Titan, the time for memory transfer is around 202 ms, and the kernel launch is around an insanely long 8343 us, which makes me believe something is wrong. enter image description here

I further isolate the problem by replacing cudaMallocManaged into good old cudaMalloc and did some profiling and observe some interesting result.

CUDASample.cu

__global__ static void CUDAKernelAddOneToVector(int *data)
{
  const int x  = blockIdx.x * blockDim.x + threadIdx.x;
  const int y  = blockIdx.y * blockDim.y + threadIdx.y;
  const int mx = gridDim.x * blockDim.x;

  data[y * mx + x] = data[y * mx + x] + 1.0f;
}

void CUDASample::AddOneToVector(std::vector<int> &in){
  int *data;
  cudaMalloc(reinterpret_cast<void **>(&data), in.size() * sizeof(int));
  cudaMemcpy(reinterpret_cast<void*>(data),reinterpret_cast<void*>(in.data()), 
             in.size() * sizeof(int), cudaMemcpyHostToDevice);

  dim3 blks(in.size()/(16*32),1);
  dim3 threads(32, 16);

  CUDAKernelAddOneToVector<<<blks, threads>>>(data);

  cudaDeviceSynchronize();

  cudaMemcpy(reinterpret_cast<void*>(in.data()),reinterpret_cast<void*>(data), 
             in.size() * sizeof(int), cudaMemcpyDeviceToHost);

  cudaFree(data);
}

For the old Maxwell Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 264 us. enter image description here

For the Pascal Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 194 us, which actually results in the performance increase I am hoping to see... enter image description here

Why is Pascal GPU so slow on running CUDA kernels when cudaMallocManaged is used? It will be a travesty if I have to revert all my existing code that uses cudaMallocManaged into cudaMalloc. This experiment also shows that the memory transfer time using cudaMallocManaged is a lot slower than using cudaMalloc, which also feels like something is wrong. If using this results in a slow run time even the code is easier, this should be unacceptable because the whole purpose of using CUDA instead of plain C++ is to speed things up. What am I doing wrong and why am I observing this kind of result?

224

asked Sep 30 '16 02:09

user3667089

1 Answers

Under CUDA 8 with Pascal GPUs, managed memory data migration under a unified memory (UM) regime will generally occur differently than on previous architectures, and you are experiencing the effects of this. (Also see note at the end about CUDA 9 updated behavior for windows.)

With previous architectures (e.g. Maxwell), managed allocations used by a particular kernel call will be migrated all at once, upon launch of the kernel, approximately as if you called cudaMemcpy to move the data yourself.

With CUDA 8 and Pascal GPUs, data migration occurs via demand-paging. At kernel launch, by default, no data is explicitly migrated to the device(*). When the GPU device code attempts to access data in a particular page that is not resident in GPU memory, a page fault will occur. The net effect of this page fault is to:

Cause the GPU kernel code (the thread or threads that accessed the page) to stall (until step 2 is complete)
Cause that page of memory to be migrated from the CPU to the GPU

This process will be repeated as necessary, as GPU code touches various pages of data. The sequence of operations involved in step 2 above involves some latency as the page fault is processed, in addition to the time spent to actually move the data. Since this process will move data a page at a time, it may be signficantly less efficient than moving all the data at once, either using cudaMemcpy or else via the pre-Pascal UM arrangement that caused all data to be moved at kernel launch (whether it was needed or not, and regardless of when the kernel code actually needed it).

Both approaches have their pros and cons, and I don't wish to debate the merits or various opinions or viewpoints. The demand-paging process enables a great many important features and capabilities for Pascal GPUs.

This particular code example, however, does not benefit. This was anticipated, and so the recommended use to bring the behavior in line with previous (e.g. maxwell) behavior/performance is to precede the kernel launch with a cudaMemPrefetchAsync() call.

You would use the CUDA stream semantics to force this call to complete prior to the kernel launch (if the kernel launch does not specify a stream, you can pass NULL for the stream parameter, to select the default stream). I believe the other parameters for this function call are pretty self-explanatory.

With this function call before your kernel call, covering the data in question, you should not observe any page-faulting in the Pascal case, and the profile behavior should be similar to the Maxwell case.

As I mentioned in the comments, if you had created a test case that involved two kernel calls in sequence, you would have observed that the 2nd call runs at approximately full speed even in the Pascal case, since all of the data has already been migrated to the GPU side through the first kernel execution. Therefore, the use of this prefetch function should not be considered mandatory or automatic, but should be used thoughtfully. There are situations where the GPU may be able to hide the latency of page-faulting to some degree, and obviously data already resident on the GPU does not need to be prefetched.

Note that the "stall" referred to in step 1 above is possibly misleading. A memory access by itself does not trigger a stall. But if the data requested is actually needed for an operation, e.g. a multiply, then the warp will stall at the multiply operation, until the necessary data becomes available. A related point, then, is that demand-paging of data from host to device in this fashion is just another "latency" that the GPU can possibly hide in it's latency-hiding architecture, if there is sufficient other available "work" to attend to.

As an additional note, in CUDA 9, the demand-paging regime for pascal and beyond is only available on linux; the previous support for Windows advertised in CUDA 8 has been dropped. See here. On windows, even for Pascal devices and beyond, as of CUDA 9, the UM regime is the same as maxwell and prior devices; data is migrated to the GPU en-masse, at kernel launch.

(*) The assumption here is that data is "resident" on the host, i.e. already "touched" or initialized in CPU code, after the managed allocation call. The managed allocation itself creates data pages associated with the device, and when CPU code "touches" these pages, the CUDA runtime will demand-page the necessary pages to be resident in host memory, so that the CPU can use them. If you perform an allocation but never "touch" the data in CPU code (an odd situation, probably) then it will actually already be "resident" in device memory when the kernel runs, and the observed behavior will be different. But that is not the case in view for this particular example/question.

Additional information is available in this blog article.

answered Nov 07 '22 15:11

Robert Crovella

Related questions
                            
                                Is it possible to make efficient pointer-based binary heap implementations?
                            
                                Why does this code behave differently if optimizing (-O2, -O3) is used?
                            
                                Why is this text not being colored by ncurses?
                            
                                array decay to pointer and overload resolution
                            
                                How can I pass a function using default parameter(s) to std:: thread?
                            
                                Does Clang/GCC really support a delay loading feature?
                            
                                Casablanca: Assembly error gcc 4.8.1 on linux centos
                            
                                QMetaProperty::read: Unable to handle unregistered datatype 'TreeItem<InspectorItem>*'
                            
                                Assign a literal to an std::u16string or to an std::u32string
                            
                                new operator with parentheses "(", ")" instead of brackets "[", "]"
                            
                                Get the Contents of a QComboBox
                            
                                Error C2280: attempting to reference a deleted function (unique_ptr)
                            
                                How to statically check if type T of a template is std::vector<U>, where U is either float, double or integral
                            
                                Facing "unable to find string literal operator" error when compiling ui code with C++11
                            
                                Why is this not a call of a pure virtual function?
                            
                                Distribute a program compiled with MinGW g++
                            
                                How to get started with Android Studio C++ development
                            
                                undefined reference to WinMain@16 C++, SDL-2
                            
                                how to use localtime_s with a pointer in c++
                            
                                Can't std::ostream output a const char array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With