Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange behavior when detecting global memory

Tags:

cuda

After reading this question: "How to differentiate between pointers to shared and global memory?", I decided to try isspacep.local, isspacep.global and isspacep.shared in a simple test program.

The tests for local and shared memory work all the time, however the global memory test does not always work, for instance it did when the device code was compiled in debug mode (-G).

At first I thought that the compiler detected that I used a dummy vector for the global memory and handled it differently, so I used -Xcicc -O0 -Xptxas -O0 (cf. "Completely disable optimizations on NVCC"). If I compute with sm_30, global memory is detected correctly. However, if I compute with sm_20 or sm_21, global memory is not detected as such. Note that with -G, any sm >= 20 works.

Is there something that I am missing here? Is there an additional flag given to the compiler when using -G that could explain these differences?

Compilation

nvcc test_pointer.cu -arch=sm_20 -Xcicc -O0 -Xptxas -O0 -Xptxas -v -o test_pointer

Code

#include <stdio.h>
#include <cuda.h>

#define CUDA_CHECK_ERROR()  __cuda_check_errors(__FILE__, __LINE__)
#define CUDA_SAFE_CALL(err) __cuda_safe_call(err, __FILE__, __LINE__)

inline void __cuda_check_errors(const char *filename, const int line_number)
{
    cudaError err = cudaDeviceSynchronize();
    if(err != cudaSuccess)
    {
        printf("CUDA error %i at %s:%i: %s\n",
               err, filename, line_number, cudaGetErrorString(err));
        exit(-1);
    }
}

inline void __cuda_safe_call(cudaError err, const char *filename, const int line_number)
{
    if (err != cudaSuccess)
    {
        printf("CUDA error %i at %s:%i: %s\n",
               err, filename, line_number, cudaGetErrorString(err));
        exit(-1);
    }
}

__device__ unsigned int __isLocal(const void *ptr)
{
  unsigned int ret;
  asm volatile ("{ \n\t"
                "    .reg .pred p; \n\t"
                "    isspacep.local p, %1; \n\t"
                "    selp.u32 %0, 1, 0, p;  \n\t"
#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__)
                "} \n\t" : "=r"(ret) : "l"(ptr));
#else
                "} \n\t" : "=r"(ret) : "r"(ptr));
#endif

  return ret;
}

__device__ unsigned int __isShared(const void *ptr)
{
  unsigned int ret;
  asm volatile ("{ \n\t"
                "    .reg .pred p; \n\t"
                "    isspacep.shared p, %1; \n\t"
                "    selp.u32 %0, 1, 0, p;  \n\t"
#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__)
                "} \n\t" : "=r"(ret) : "l"(ptr));
#else
                "} \n\t" : "=r"(ret) : "r"(ptr));
#endif

  return ret;
}

__device__ void analyze_pointer(const void *ptr)
{
    printf("\t* is local:  %u\n", __isLocal(ptr));
    printf("\t* is global: %u\n", __isGlobal(ptr));
    printf("\t* is shared: %u\n", __isShared(ptr));
}

template <typename T, unsigned int N>
__global__ void test_kernel(T *vec)
{
    // Shared array
    __shared__ T shared_vec[10];

    // Register array
    T reg[10];

    if (blockIdx.x == 0 && threadIdx.x == 0)
    {
        printf("Register array:\n");
        analyze_pointer(&reg);

        printf("\nGlobal array:\n");
        analyze_pointer(vec);

        printf("\nShared array:\n");
        analyze_pointer(&shared_vec);
    }
}

int main()
{
    typedef float type_t;
    const unsigned int N = 128;

    type_t* d_vec;

    CUDA_SAFE_CALL(cudaMalloc(&d_vec, N * sizeof(type_t)));

    test_kernel<type_t, N><<<1, N>>>(d_vec);
    CUDA_CHECK_ERROR();

    CUDA_SAFE_CALL(cudaFree(d_vec));
}

Output

Register array:
    * is local:  1
    * is global: 0
    * is shared: 0

Global array:
    * is local:  0
    * is global: 0 (or 1 with -G or sm_30)
    * is shared: 0

Shared array:
    * is local:  0
    * is global: 0
    * is shared: 1

Hardware/software properties

This was tested with CUDA 5.0, GeForce GT 650M (CC 3.0), drivers 319.17 on Arch Linux 64-bit.

UPDATE #1

I just tested this code with a Tesla C2070 (CC 2.0) with the 304.88 drivers, CUDA 5.0 on Linux 64-bit, and it works. Global memory is detected when optimization is turned off, i.e. -arch=sm_20 -Xcicc -O0, or when an extra printf("\t* ptr = %ld\n", ptr); is added (cf. @RobertCrovella's comment). It does sound like a driver issue.

UPDATE #2

I made some more tests and here is what I get with my CC 3.0 device depending on how I compile the program:

-arch=sm_30                                               ---> undetected (probably optimized)
-arch=sm_30 -Xcicc -O0 -Xptxas -O0                        ---> OK
-arch=sm_30 -G                                            ---> OK
-arch=compute_30 -code=sm_30 -Xcicc -O0 -Xptxas -O0       ---> OK
-arch=compute_30 -code=sm_30 -G                           ---> OK
-arch=compute_30 -code=compute_30 -Xcicc -O0 -Xptxas -O0  ---> undetected
-arch=compute_30 -code=compute_30 -G                      ---> OK
-arch=sm_20                                               ---> undetected
-arch=sm_20 -Xcicc -O0 -Xptxas -O0                        ---> undetected
-arch=sm_20 -G                                            ---> OK
-arch=compute_20 -Xcicc -O0 -Xptxas -O0                   ---> undetected
-arch=compute_20 -G                                       ---> OK
-arch=compute_20 -code=sm_20 -Xcicc -O0 -Xptxas -O0       ---> runtime error (as expected)
-arch=compute_20 -code=sm_20 -G                           ---> runtime error (as expected)
-arch=compute_20 -code=compute_20 -Xcicc -O0 -Xptxas -O0  ---> undetected
-arch=compute_20 -code=compute_20 -G                      ---> OK
-arch=compute_20 -code=sm_30                              ---> undetected (probably optimized)
-arch=compute_20 -code=sm_30 -Xcicc -O0 -Xptxas -O0       ---> OK
-arch=compute_20 -code=sm_30 -G                           ---> OK
like image 786
BenC Avatar asked Nov 13 '22 05:11

BenC


1 Answers

This was apparently a bug in CUDA and the fix should be released with CUDA 6.0.

like image 156
BenC Avatar answered Dec 25 '22 14:12

BenC