I want to use assembly code in CUDA C code in order to reduce expensive executions as we do using asm in c programming. Is it possible?

No, you can't, there is nothing like the asm constructs from C/C++. What you can do is tweak the generated PTX assembly and then use it with CUDA. See this for an example. But for GPUs, assembly optimizations are NOT necessary, you should do other optimizations first, such as memory coalescency and occupancy. See the CUDA Best Practices guide for more information.

Is it possible to put assembly instructions into CUDA code?

2 Answers

Since CUDA 4.0, inline PTX is supported by the CUDA toolchain. There is a document in the toolkit that describes it: Using_Inline_PTX_Assembly_In_CUDA.pdf

Below is some code demonstrating use of inline PTX in CUDA 4.0. Note that this code should not be used as a replacement for CUDA's built-in __clz() function, I merely wrote it to explore aspects of the new inline PTX capability.

__device__ __forceinline__ int my_clz (unsigned int x)
{
    int res;

    asm ("{\n"
         "        .reg .pred iszero, gezero;\n"
         "        .reg .u32 t1, t2;\n"
         "        mov.b32         t1, %1;\n"
         "        shr.u32         %0, t1, 16;\n"
         "        setp.eq.b32     iszero, %0, 0;\n"
         "        mov.b32         %0, 0;\n"
         "@iszero shl.b32         t1, t1, 16;\n"
         "@iszero or.b32          %0, %0, 16;\n"
         "        and.b32         t2, t1, 0xff000000;\n"
         "        setp.eq.b32     iszero, t2, 0;\n"
         "@iszero shl.b32         t1, t1, 8;\n"
         "@iszero or.b32          %0, %0, 8;\n"
         "        and.b32         t2, t1, 0xf0000000;\n"
         "        setp.eq.b32     iszero, t2, 0;\n"
         "@iszero shl.b32         t1, t1, 4;\n"
         "@iszero or.b32          %0, %0, 4;\n"
         "        and.b32         t2, t1, 0xc0000000;\n"
         "        setp.eq.b32     iszero, t2, 0;\n"
         "@iszero shl.b32         t1, t1, 2;\n"
         "@iszero or.b32          %0, %0, 2;\n"
         "        setp.ge.s32     gezero, t1, 0;\n"
         "        setp.eq.b32     iszero, t1, 0;\n"
         "@gezero or.b32          %0, %0, 1;\n"
         "@iszero add.u32         %0, %0, 1;\n\t"
         "}"
         : "=r"(res)
         : "r"(x));
    return res;
}

168

answered Sep 27 '22 23:09

njuffa

No, you can't, there is nothing like the asm constructs from C/C++. What you can do is tweak the generated PTX assembly and then use it with CUDA.

See this for an example.

But for GPUs, assembly optimizations are NOT necessary, you should do other optimizations first, such as memory coalescency and occupancy. See the CUDA Best Practices guide for more information.

answered Sep 27 '22 23:09

Dr. Snoopy

Related questions
                            
                                Convert char to short
                            
                                Valgrind - Invalid write of size 1 for strcpy
                            
                                char array not assignable
                            
                                Do I need to free char array of fixed length? [duplicate]
                            
                                Function without return type specified in C
                            
                                Char* array of chars, but int* not array of ints?
                            
                                Assigning multiple integers separated by comma to an int in C - Why does that work? What for? [duplicate]
                            
                                What is Reentrant function in c? [duplicate]
                            
                                Implementation of strdup() in C programming
                            
                                What does it mean that the language of preprocessor directives is weakly related to the grammar of C?
                            
                                Why are function bodies in C/C++ placed in separate source code files instead of headers?
                            
                                Python Ctypes - loading dll throws OSError: [WinError 193] %1 is not a valid Win32 application
                            
                                How would you unittest a memory allocator?
                            
                                Float values behaving differently across the release and debug builds
                            
                                Segmentation fault - char pointer
                            
                                Where does `getchar()` store the user input?
                            
                                In C, how do I restrict the scope of a global variable to the file in which it's declared?
                            
                                C Library for Parsing Date Time [closed]
                            
                                How to store data inside the executable file
                            
                                c++: local array definition versus a malloc call

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to put assembly instructions into CUDA code?

Tags:

c

assembly

cuda

inline-assembly

ptx

superscalar

People also ask

2 Answers

njuffa

Dr. Snoopy

Recent Activity

Donate For Us