cuda memory alignment

Question

In my code I am using structures in order to facilitate the passing of arguements to functions (I don't use arrays of structures, but instead structures of arrays in general). When I am in cuda-gdb and I examine the point in a kernel where I give values to a simple structure like

struct pt{
int i;
int j;
int k;
}

even though I am not doing something complicated and it's obvious that the members should have the values appointed, I get...

Asked for position 0 of stack, stack only has 0 elements on it.

So I am thinking that even though it's not an array, maybe there is a problem with the alignment of memory at that point. So I change the definition in the header file to

struct __align__(16) pt{
int i;
int j;
int k;
}

but then, when the compiler tries to compile the host-code files that use the same definitions, gives the following error:

error: expected unqualified-id before numeric constant error: expected ‘)’ before numeric constant error: expected constructor, destructor, or type conversion before ‘;’ token

so, am I supposed to have two different definitions for host and device structures ???

Further I would like to ask how to generalize the logic of alignment. I am not a computer scientist, so the two examples in the programming guide don't help me get the big picture.

For example, how should the following two be aligned? or, how should a structure with 6 floats be aligned? or 4 integers? again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.

struct {
    int a;
    int b;
    int c;
    int d;
    float* el;    
} ;

 struct {
    int a;
    int b
    int c
    int d
    float* i;
    float* j;
    float* k;
} ;

Thank you in advance for any advice or hints

harrism · Accepted Answer

There are a lot of questions in this post. Since the CUDA programming guide does a pretty good job of explaining alignment in CUDA, I'll just explain a few things that are not obvious in the guide.

First, the reason your host compiler gives you errors is because the host compiler doesn't know what __align(n)__ means, so it is giving a syntax error. What you need is to put something like the following in a header for your project.

#if defined(__CUDACC__) // NVCC
   #define MY_ALIGN(n) __align__(n)
#elif defined(__GNUC__) // GCC
  #define MY_ALIGN(n) __attribute__((aligned(n)))
#elif defined(_MSC_VER) // MSVC
  #define MY_ALIGN(n) __declspec(align(n))
#else
  #error "Please provide a definition for MY_ALIGN macro for your host compiler!"
#endif

So, am I supposed to have two different definitions for host and device structures?

No, just use MY_ALIGN(n), like this

struct MY_ALIGN(16) pt { int i, j, k; }

For example, how should the following two be aligned?

First, __align(n)__ (or any of the host compiler flavors), enforces that the memory for the struct begins at an address in memory that is a multiple of n bytes. If the size of the struct is not a multiple of n, then in an array of those structs, padding will be inserted to ensure each struct is properly aligned. To choose a proper value for n, you want to minimize the amount of padding required. As explained in the programming guide, the hardware requires each thread reads words aligned to 1,2,4, 8 or 16 bytes. So...

struct MY_ALIGN(16) {
  int a;
  int b;
  int c;
  int d;
  float* el;    
};

In this case let's say we choose 16-byte alignment. On a 32-bit machine, the pointer takes 4 bytes, so the struct takes 20 bytes. 16-byte alignment will waste 16 * (ceil(20/16) - 1) = 12 bytes per struct. On a 64-bit machine, it will waste only 8 bytes per struct, due to the 8-byte pointer. We can reduce the waste by using MY_ALIGN(8) instead. The tradeoff will be that the hardware will have to use 3 8-byte loads instead of 2 16-byte loads to load the struct from memory. If you are not bottlenecked by the loads, this is probably a worthwhile tradeoff. Note that you don't want to align smaller than 4 bytes for this struct.

struct MY_ALIGN(16) {
  int a;
  int b
  int c
  int d
  float* i;
  float* j;
  float* k;
};

In this case with 16-byte alignment you waste only 4 bytes per struct on 32-bit machines, or 8 on 64-bit machines. It would require two 16-byte loads (or 3 on a 64-bit machine). If we align to 8 bytes, we could eliminate waste entirely with 4-byte alignment (8-byte on 64-bit machines), but this would result in excessive loads. Again, tradeoffs.

or, how should a structure with 6 floats be aligned?

Again, tradeoffs: either waste 8 bytes per struct or require two loads per struct.

or 4 integers?

No tradeoff here. MY_ALIGN(16).

again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.

Hmmm, if you are not using arrays of these, then you may not need to align at all. But how are you assigning to them? As you are probably seeing, all that waste is important to worry about—it's another good reason to favor structures of arrays over arrays of structures.

einpoklum · Answer

These days, you should use the C++11 alignas specifier, which is supported by GCC (including the versions compatible with current CUDA), by MSVC since the 2015 version, and IIANM by nvcc as well. That should save you the need to resort to macros.

cuda memory alignment

Tags:

cuda

Panagiotis

2 Answers

harrism

einpoklum

Recent Activity

Donate For Us

cuda memory alignment

Tags:

cuda

Panagiotis

2 Answers

harrism

einpoklum

Related questions

Recent Activity

Donate For Us