Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between __ldg() intrinsic and a normal execution?

Tags:

cuda

I am trying to explore '__ldg intrinsic'. I have gone through NVIDIA's documentation for this but didn't get any satisfactory answer over its use and implementations. Moreover with reference to THIS I tried implementing __ldg in a simple 1024*1024 matrix multiplication example.

#include<stdio.h>
#include<stdlib.h>

__global__ void matrix_mul(float * ad,float * bd,float * cd,int N)
{
        float pvalue=0;
        //find Row and Column corresponding to a data element for each thread
        int Row = blockIdx.y * blockDim.y + threadIdx.y;
        int Col = blockIdx.x * blockDim.x + threadIdx.x;
        //calculate dot product of Row of First Matrix and Column of Second Matrix
        for(int i=0;i< N;++i)
        {
//   I tried with executing this first:
            float m=__ldg(&ad[Row * N+i]);
            float n=__ldg(&bd[i * N + Col]);

//Then I executed this as a normal execution:
//          float m = ad[Row * N+i];
//          float n = bd[i * N + Col];

            pvalue += m * n;
         }
        //store dot product at corresponding position in resultant Matrix
        cd[Row * N + Col] = pvalue;
}

int main()
{
    int N = 1024,i,j;               //N == size of square matrix

    float *a,*b;
    float *ad,*bd,*cd,*c;

    //open a file for outputting the result
    FILE *f;
    f=fopen("Parallel Multiply_ldg.txt","w");

    size_t size=sizeof(float)* N * N;

    //allocate host side memory
    a=(float*)malloc(size);
    b=(float*)malloc(size);
    c=(float*)malloc(size);

    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
        {
            a[i*N+j]=2.0;   //(float)(i*N+j);       //initializing each value with its own index
            b[i*N+j]=1.0;   //(float)(i*N+j);       //random functions can be used alternatively
        }
    }

    //allocate device memory
    cudaMalloc(&ad,size);
    //printf("\nAfter cudaMalloc for ad\n%s\n",cudaGetErrorString(cudaGetLastError()));
    cudaMalloc(&bd,size);
    //printf("\nAfter cudaMalloc bd\n%s\n",cudaGetErrorString(cudaGetLastError()));
    cudaMalloc(&cd,size);
    //printf("\nAfter cudaMalloc cd\n%s\n",cudaGetErrorString(cudaGetLastError()));

    //copy value from host to device
    cudaMemcpy(ad,a,size,cudaMemcpyHostToDevice);
    cudaMemcpy(bd,b,size,cudaMemcpyHostToDevice);

    printf("\nAfter HostToDevice Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));

    //calculate execution configuration
    dim3 blocksize(16,16);              //each block contains 16 * 16 (=256) threads
    dim3 gridsize(N/16,N/16);           //creating just sufficient no of blocks

    //GPU timer code
    float time;
    cudaEvent_t start,stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start,0);

    matrix_mul <<< gridsize, blocksize >>> (ad,bd,cd, N);
    cudaDeviceSynchronize();
    cudaEventRecord(stop,0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time,start,stop);         //time taken in kernel call calculated
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    //copy back results
    cudaMemcpy(c,cd,sizeof(float)* N*N,cudaMemcpyDeviceToHost);

    printf("\nAfter DeviceToHost Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));

    //output results in output_file
    fprintf(f,"Array A was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            fprintf(f,"%f ",a[i*N+j]);
        fprintf(f,"\n");
    }
    fprintf(f,"\nArray B was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            fprintf(f,"%f ",b[i*N+j]);
        fprintf(f,"\n");
    }
    fprintf(f,"\nMultiplication of A and B gives C----\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            fprintf(f,"%f ",c[i*N+j]);              //if correctly computed, then all values must be N
        fprintf(f,"\n");
    }
    printf("\nYou can see output in Parallel Mutiply.txt file in project directory");
    printf("\n\nTime taken is %f (ms)\n",time);
    fprintf(f,"\n\nTime taken is %f (ms)\n",time);
    fclose(f);

    cudaThreadExit();
    //cudaFree(ad); cudaFree(bd); cudaFree (cd);
    free(a);free(b);free(c);
    //_getch();
    return 1;
}

I commented that __ldg part in my kernel and executed by normal execution, and vice versa. In both cases it gives me correct multiplication result. I am confused with the time difference I am getting between these executions, because its huge almost more than 100X!

In case of __ldg it gives me: Time taken is 0.014432 (ms)

And in case of normal execution without __ldg it gives me : Time taken is 36.858398 (ms)

Is this the exact way of using __ldg intrisic? What is the significance of __ldg intrinsic and what is the proper way of using it? Apparently what I did above in my code is wrong and naive. I am looking for explanation and example. Thanks in advance.

like image 300
sandeep.ganage Avatar asked Oct 28 '14 07:10

sandeep.ganage


2 Answers

From the CUDA C Programming Guide

Global memory accesses for devices of compute capability 3.x are cached in L2 and for devices of compute capability 3.5, may also be cached in the read-only data cache described in the previous section; they are not cached in L1.

...

Data that is read-only for the entire lifetime of the kernel can also be cached in the read-only data cache described in the previous section by reading it using the __ldg() function (see Read-Only Data Cache Load Function). When the compiler detects that the read-only condition is satisfied for some data, it will use __ldg() to read it. The compiler might not always be able to detect that the read-only condition is satisfied for some data. Marking pointers used for loading such data with both the const and __restrict__ qualifiers increases the likelihood that the compiler will detect the read-only condition.

The read only cache accesses have a much lower latency than the global memory accesses. Because matrix multiplication accesses the same values from memory many times, caching in the read only cache gives a huge speedup (in memory bound applications).

like image 91
Avi Ginsburg Avatar answered Sep 23 '22 00:09

Avi Ginsburg


In NVIDIA GPU there is a texture - images with special and not hard logic to work with images.

This texture memory is another type of memory available in GPU. In particularly constant, global and register file memory has not any relation to this texture memory.

Kepler GPUs and later add the ability to use this memory from "GPU texture pipeline".

But let's specify the difference between constant cache and read-only cache.

Constant Cache

Data loaded through the constant cache must be relatively small and must be accessed in such way that all threads of a warp should access the same location at any given time.

Read-only Cache or Texture Memory Cache

Cache can be much larger and can be accessed in a non-uniform pattern. Read Only cache has granularity 32 bytes.


You can use this as "read-only cache" for your CUDA kernel.

1. Data stored in global memory can be cached in that place GPU Texture Memory
2. With doing that you give promise to the compiler that data is read-only for the 
   duration of a kernel execution in GPU.  

There are two ways to achieve this.

A. Using an intrinsic function __ldg

Example: output[i] += __ldg(&input[j]);

B. Qualifying pointers to global memory

const float* __restrict__ input
output[idx] += input[idx];

Comparision:

The intrinsic __ldg is a better choice for deep compiler reasons.

like image 22
Konstantin Burlachenko Avatar answered Sep 24 '22 00:09

Konstantin Burlachenko