Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clarifying memory transactions in CUDA

Tags:

cuda

gpu

I am confused about the following statements in the CUDA programming guide 4.0 section 5.3.2.1 in the chapter of Performance Guidelines.

Global memory resides in device memory and device memory is accessed
via 32-, 64-, or 128-byte memory transactions. 

These memory transactions must be naturally aligned:Only the 32-, 64- , 
128- byte segments of device memory 
that are aligned to their size (i.e. whose first address is a 
multiple of their size) can be read or written by memory 
transactions.

1) My understanding of device memory was that accesses to the device memory by threads is uncached: So if thread accesses memory location a[i] it will fetch only a[i] and none of the values around a[i]. So the first statement seems to contradict this. Or perhaps I am misunderstanding the usage of the phrase "memory transaction" here?

2) The second sentence does not seem very clear. Can someone explain this?

like image 896
smilingbuddha Avatar asked Aug 10 '12 19:08

smilingbuddha


People also ask

What is CUDA memory explain CUDA memory types in detail?

A CUDA device has a number of different memory components that are available to programmers - register, shared memory, local memory, global memory and constant memory. Figure 6 illustrates how threads in the CUDA device can access the different memory components. In CUDA only threads and the host can access memory.

How does CUDA memory work?

It is used for storing data that will not change over the course of kernel execution. It supports short-latency, high-bandwidth, read-only access by the device when all threads simultaneously access the same location. There is a total of 64K constant memory on a CUDA capable device. The constant memory is cached.

What is function of __ global __ qualifier in CUDA program?

__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).

What is memory hierarchy and bank conflict in CUDA?

Shared memory is used to enable fast communication between threads in a block. Shared memory only exists for the lifetime of the block. Bank conflicts can slow access down. It's fastest when all threads read from different banks or all threads of a warp read exactly the same value.


1 Answers

  1. Memory transactions are performed per warp. So 32 byte transactions is a warp sized read of an 8 bit type, 64 byte transactions is a warp sized read of an 16 bit type, and 128 byte transactions is a warp sized read of an 32 bit type.
  2. It just means that all reads have to be aligned to a natural word size boundary. It is not possible for a warp to read a 128 byte transaction with a one byte offset. See this answer for more details.
like image 57
talonmies Avatar answered Sep 16 '22 20:09

talonmies