Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is inline PTX assembly code powerful?

Tags:

cuda

I saw some code samples where people use inline PTX assembly code in C code. Doc in CUDA toolkit mentions that PTX is powerful, why is it so? What advantage we get if we use such codes in our C code?

like image 880
username_4567 Avatar asked Dec 27 '22 17:12

username_4567


1 Answers

Inline PTX gives you access to instructions not exposed via CUDA intrinsincs, and lets you apply optimizations that are either lacking in the compiler or prohibited by language specifications. For a worked example where use of inline PTX is advantageous, see: 128 bit integer on cuda?

The 128-bit addition using inline PTX requires just four instructions, since it has direct access to the carry flag. As a HLL, C/C++ does not have a representation for a carry flag, as a given hardware platform may have no carry flag (e.g. MIPS), a single carry flag (e.g. x86, sm_2x), or even multiple carry flags. In contrast to the 4-instruction PTX versions of 128-bit addition and subtraction, these operations might be coded in C as follows:

#define SUBCcc(a,b,cy,t0,t1,t2) \
  (t0=(b)+cy, t1=(a), cy=t0<cy, t2=t1<t0, cy=cy+t2, t1-t0)
#define SUBcc(a,b,cy,t0,t1) \
  (t0=(b), t1=(a), cy=t1<t0, t1-t0)
#define SUBC(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), t1-t0)
#define ADDCcc(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
  (t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), t0+t1)

unsigned int cy, t0, t1, t2;

res.x = ADDcc  (augend.x, addend.x, cy, t0, t1);
res.y = ADDCcc (augend.y, addend.y, cy, t0, t1);
res.z = ADDCcc (augend.z, addend.z, cy, t0, t1);
res.w = ADDC   (augend.w, addend.w, cy, t0, t1);

res.x = SUBcc  (minuend.x, subtrahend.x, cy, t0, t1);
res.y = SUBCcc (minuend.y, subtrahend.y, cy, t0, t1, t2);
res.z = SUBCcc (minuend.z, subtrahend.z, cy, t0, t1, t2);
res.w = SUBC   (minuend.w, subtrahend.w, cy, t0, t1);

The addition and subtraction above probably compile to about three to four times the number of instructions used by the corresponding inline PTX version.

like image 126
njuffa Avatar answered Dec 29 '22 06:12

njuffa