How to use clang to compile OpenCL to ptx code?

Question

Clang 3.0 is able to compile OpenCL to ptx and use Nvidia's tool to launch the ptx code on GPU. How can I do this? Please be specific.

sschuberth · Accepted Answer

See Justin Holewinski's blog for a specific example or this thread for some more detailed steps and links to samples.

MrSlope · Answer

With the current version of of llvm(3.4), libclc and nvptx back-end, the compilation process has changed slightly.

You have to explicitly tell the nvptx backend which driver interface to use; your options are nvptx-nvidia-cuda or nvptx-nvidia-nvcl (for OpenCL) and their 64 bit equivalents nvptx64-nvidia-cuda or nvptx64-nvidia-nvcl.

The generated .ptx code differs slightly according to the chosen interface. In the assembly code produced for the CUDA driver API, intrinsics .global and .ptr are dropped from entry functions but they are required by OpenCL. I've modified Mikael's compile steps slightly to produce code that can be run with OpenCL host:

Compile to LLVM IR:

clang -Dcl_clang_storage_class_specifiers -isystem libclc/generic/include -include clc/clc.h -target nvptx64-nvidia-nvcl -xcl test.cl -emit-llvm -S -o test.ll

Link kernel:

llvm-link libclc/built_libs/nvptx64--nvidiacl.bc test.ll -o test.linked.bc

Compile to Ptx:

clang -target nvptx64-nvidia-nvcl  test.linked.bc -S -o test.nvptx.s

Mikael Lepistö · Answer

Here is brief guide how to do it with Clang trunk (3.4 at this point) and libclc. I assume you have basic knowledge how to configure and compile LLVM and Clang, so I just listed the configure flags I have used.

square.cl:

__kernel void vector_square(__global float4* input,  __global float4* output) {
  int i = get_global_id(0);
  output[i] = input[i]*input[i];
}

Compile llvm and clang with nvptx support:

../llvm-trunk/configure --prefix=$PWD/../install-trunk --enable-debug-runtime --enable-jit --enable-targets=x86,x86_64,nvptx
make install

Get libclc (git clone http://llvm.org/git/libclc.git) and compile it.

./configure.py --with-llvm-config=$PWD/../install-trunk/bin/llvm-config
make

If you have problem compiling this you might need to fix couple of headers in ./utils/prepare-builtins.cpp

-#include "llvm/Function.h"
-#include "llvm/GlobalVariable.h"
-#include "llvm/LLVMContext.h"
-#include "llvm/Module.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/GlobalVariable.h"
+#include "llvm/IR/LLVMContext.h"
+#include "llvm/IR/Module.h"

Compile kernel to LLVM IR assember:

clang -Dcl_clang_storage_class_specifiers -isystem libclc/generic/include -include clc/clc.h -target nvptx -xcl square.cl -emit-llvm -S -o square.ll

Link kernel with builtin implementations from libclc

llvm-link libclc/nvptx--nvidiacl/lib/builtins.bc square.ll -o square.linked.bc

Compile fully linked LLVM IR to PTX

clang -target nvptx square.linked.bc -S -o square.nvptx.s

square.nvptx.s:

    //
    // Generated by LLVM NVPTX Back-End
    //
    .version 3.1
    .target sm_20, texmode_independent
    .address_size 32

            // .globl       vector_square

    .entry vector_square(
            .param .u32 .ptr .global .align 16 vector_square_param_0,
            .param .u32 .ptr .global .align 16 vector_square_param_1
    )
    {
            .reg .pred %p<396>;
            .reg .s16 %rc<396>;
            .reg .s16 %rs<396>;
            .reg .s32 %r<396>;
            .reg .s64 %rl<396>;
            .reg .f32 %f<396>;
            .reg .f64 %fl<396>;

            ld.param.u32    %r0, [vector_square_param_0];
            mov.u32 %r1, %ctaid.x;
            ld.param.u32    %r2, [vector_square_param_1];
            mov.u32 %r3, %ntid.x;
            mov.u32 %r4, %tid.x;
            mad.lo.s32      %r1, %r3, %r1, %r4;
            shl.b32         %r1, %r1, 4;
            add.s32         %r0, %r0, %r1;
            ld.global.v4.f32        {%f0, %f1, %f2, %f3}, [%r0];
            mul.f32         %f0, %f0, %f0;
            mul.f32         %f1, %f1, %f1;
            mul.f32         %f2, %f2, %f2;
            mul.f32         %f3, %f3, %f3;
            add.s32         %r0, %r2, %r1;
            st.global.f32   [%r0+12], %f3;
            st.global.f32   [%r0+8], %f2;
            st.global.f32   [%r0+4], %f1;
            st.global.f32   [%r0], %f0;
            ret;
    }

How to use clang to compile OpenCL to ptx code?

Tags:

clang

opencl

dalibocai

3 Answers

sschuberth

MrSlope

Mikael Lepistö

Recent Activity

Donate For Us

How to use clang to compile OpenCL to ptx code?

Tags:

clang

opencl

dalibocai

3 Answers

sschuberth

MrSlope

Mikael Lepistö

Related questions

Recent Activity

Donate For Us