[Please look at the edit below, the solution to the question could simply be there]
I'm trying to learn OpenCL through the study of a small ray tracer (see the code below, from this link).
I don't have a "real" GPU, I'm currently on a macosx laptop with Intel(R) Iris(TM) Graphics 6100 graphic cards.
The code works well on the CPU but its behavior is strange on the GPU. It works (or not) depending on the number of samples per pixel (the number of rays that are shot through the pixel to get its color after propagating the rays in the scene). If I take a small number of sample (64) I can have a 1280x720 picture but if I take 128 samples I'm only able to get a smaller picture. As I understand things, the number of samples should not change anything (except for the quality of the picture of course). Is there something purely related to OpenCL/GPU that I miss ?
Moreover, it seems to be the extraction of the results from the memory of the GPU that crashes :
queue.enqueueReadBuffer(cl_output, CL_TRUE, 0, image_width * image_height * sizeof(cl_float4), cpu_output);
I get an "Abort trap: 6" at this stage.
I'm missing something.
[EDIT] After some research I found an interesting trail : the graphic card may voluntarily aborts the task because it takes too much time. This behavior would have been put in place to avoid "frozen" screen. This topic talk about that.
What do you think about that ?
I can't find the way to turn this behavior off. Do you know how to do ?
Here are the files:
main.cpp:
// OpenCL based simple sphere path tracer by Sam Lapere, 2016
// based on smallpt by Kevin Beason
// http://raytracey.blogspot.com
#include <iostream>
#include <fstream>
#include <vector>
#include <CL\cl.hpp>
using namespace std;
using namespace cl;
const int image_width = 1280;
const int image_height = 720;
cl_float4* cpu_output;
CommandQueue queue;
Device device;
Kernel kernel;
Context context;
Program program;
Buffer cl_output;
Buffer cl_spheres;
// dummy variables are required for memory alignment
// float3 is considered as float4 by OpenCL
struct Sphere
{
cl_float radius;
cl_float dummy1;
cl_float dummy2;
cl_float dummy3;
cl_float3 position;
cl_float3 color;
cl_float3 emission;
};
void pickPlatform(Platform& platform, const vector<Platform>& platforms){
if (platforms.size() == 1) platform = platforms[0];
else{
int input = 0;
cout << "\nChoose an OpenCL platform: ";
cin >> input;
// handle incorrect user input
while (input < 1 || input > platforms.size()){
cin.clear(); //clear errors/bad flags on cin
cin.ignore(cin.rdbuf()->in_avail(), '\n'); // ignores exact number of chars in cin buffer
cout << "No such option. Choose an OpenCL platform: ";
cin >> input;
}
platform = platforms[input - 1];
}
}
void pickDevice(Device& device, const vector<Device>& devices){
if (devices.size() == 1) device = devices[0];
else{
int input = 0;
cout << "\nChoose an OpenCL device: ";
cin >> input;
// handle incorrect user input
while (input < 1 || input > devices.size()){
cin.clear(); //clear errors/bad flags on cin
cin.ignore(cin.rdbuf()->in_avail(), '\n'); // ignores exact number of chars in cin buffer
cout << "No such option. Choose an OpenCL device: ";
cin >> input;
}
device = devices[input - 1];
}
}
void printErrorLog(const Program& program, const Device& device){
// Get the error log and print to console
string buildlog = program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device);
cerr << "Build log:" << std::endl << buildlog << std::endl;
// Print the error log to a file
FILE *log = fopen("errorlog.txt", "w");
fprintf(log, "%s\n", buildlog);
cout << "Error log saved in 'errorlog.txt'" << endl;
system("PAUSE");
exit(1);
}
void initOpenCL()
{
// Get all available OpenCL platforms (e.g. AMD OpenCL, Nvidia CUDA, Intel OpenCL)
vector<Platform> platforms;
Platform::get(&platforms);
cout << "Available OpenCL platforms : " << endl << endl;
for (int i = 0; i < platforms.size(); i++)
cout << "\t" << i + 1 << ": " << platforms[i].getInfo<CL_PLATFORM_NAME>() << endl;
// Pick one platform
Platform platform;
pickPlatform(platform, platforms);
cout << "\nUsing OpenCL platform: \t" << platform.getInfo<CL_PLATFORM_NAME>() << endl;
// Get available OpenCL devices on platform
vector<Device> devices;
platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);
cout << "Available OpenCL devices on this platform: " << endl << endl;
for (int i = 0; i < devices.size(); i++){
cout << "\t" << i + 1 << ": " << devices[i].getInfo<CL_DEVICE_NAME>() << endl;
cout << "\t\tMax compute units: " << devices[i].getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>() << endl;
cout << "\t\tMax work group size: " << devices[i].getInfo<CL_DEVICE_MAX_WORK_GROUP_SIZE>() << endl << endl;
}
// Pick one device
pickDevice(device, devices);
cout << "\nUsing OpenCL device: \t" << device.getInfo<CL_DEVICE_NAME>() << endl;
cout << "\t\t\tMax compute units: " << device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>() << endl;
cout << "\t\t\tMax work group size: " << device.getInfo<CL_DEVICE_MAX_WORK_GROUP_SIZE>() << endl;
// Create an OpenCL context and command queue on that device.
context = Context(device);
queue = CommandQueue(context, device);
// Convert the OpenCL source code to a string
string source;
ifstream file("opencl_kernel.cl");
if (!file){
cout << "\nNo OpenCL file found!" << endl << "Exiting..." << endl;
system("PAUSE");
exit(1);
}
while (!file.eof()){
char line[256];
file.getline(line, 255);
source += line;
}
const char* kernel_source = source.c_str();
// Create an OpenCL program by performing runtime source compilation for the chosen device
program = Program(context, kernel_source);
cl_int result = program.build({ device });
if (result) cout << "Error during compilation OpenCL code!!!\n (" << result << ")" << endl;
if (result == CL_BUILD_PROGRAM_FAILURE) printErrorLog(program, device);
// Create a kernel (entry point in the OpenCL source program)
kernel = Kernel(program, "render_kernel");
}
void cleanUp(){
delete cpu_output;
}
inline float clamp(float x){ return x < 0.0f ? 0.0f : x > 1.0f ? 1.0f : x; }
// convert RGB float in range [0,1] to int in range [0, 255] and perform gamma correction
inline int toInt(float x){ return int(clamp(x) * 255 + .5); }
void saveImage(){
// write image to PPM file, a very simple image file format
// PPM files can be opened with IrfanView (download at www.irfanview.com) or GIMP
FILE *f = fopen("opencl_raytracer.ppm", "w");
fprintf(f, "P3\n%d %d\n%d\n", image_width, image_height, 255);
// loop over all pixels, write RGB values
for (int i = 0; i < image_width * image_height; i++)
fprintf(f, "%d %d %d ",
toInt(cpu_output[i].s[0]),
toInt(cpu_output[i].s[1]),
toInt(cpu_output[i].s[2]));
}
#define float3(x, y, z) {{x, y, z}} // macro to replace ugly initializer braces
void initScene(Sphere* cpu_spheres){
// left wall
cpu_spheres[0].radius = 200.0f;
cpu_spheres[0].position = float3(-200.6f, 0.0f, 0.0f);
cpu_spheres[0].color = float3(0.75f, 0.25f, 0.25f);
cpu_spheres[0].emission = float3(0.0f, 0.0f, 0.0f);
// right wall
cpu_spheres[1].radius = 200.0f;
cpu_spheres[1].position = float3(200.6f, 0.0f, 0.0f);
cpu_spheres[1].color = float3(0.25f, 0.25f, 0.75f);
cpu_spheres[1].emission = float3(0.0f, 0.0f, 0.0f);
// floor
cpu_spheres[2].radius = 200.0f;
cpu_spheres[2].position = float3(0.0f, -200.4f, 0.0f);
cpu_spheres[2].color = float3(0.9f, 0.8f, 0.7f);
cpu_spheres[2].emission = float3(0.0f, 0.0f, 0.0f);
// ceiling
cpu_spheres[3].radius = 200.0f;
cpu_spheres[3].position = float3(0.0f, 200.4f, 0.0f);
cpu_spheres[3].color = float3(0.9f, 0.8f, 0.7f);
cpu_spheres[3].emission = float3(0.0f, 0.0f, 0.0f);
// back wall
cpu_spheres[4].radius = 200.0f;
cpu_spheres[4].position = float3(0.0f, 0.0f, -200.4f);
cpu_spheres[4].color = float3(0.9f, 0.8f, 0.7f);
cpu_spheres[4].emission = float3(0.0f, 0.0f, 0.0f);
// front wall
cpu_spheres[5].radius = 200.0f;
cpu_spheres[5].position = float3(0.0f, 0.0f, 202.0f);
cpu_spheres[5].color = float3(0.9f, 0.8f, 0.7f);
cpu_spheres[5].emission = float3(0.0f, 0.0f, 0.0f);
// left sphere
cpu_spheres[6].radius = 0.16f;
cpu_spheres[6].position = float3(-0.25f, -0.24f, -0.1f);
cpu_spheres[6].color = float3(0.9f, 0.8f, 0.7f);
cpu_spheres[6].emission = float3(0.0f, 0.0f, 0.0f);
// right sphere
cpu_spheres[7].radius = 0.16f;
cpu_spheres[7].position = float3(0.25f, -0.24f, 0.1f);
cpu_spheres[7].color = float3(0.9f, 0.8f, 0.7f);
cpu_spheres[7].emission = float3(0.0f, 0.0f, 0.0f);
// lightsource
cpu_spheres[8].radius = 1.0f;
cpu_spheres[8].position = float3(0.0f, 1.36f, 0.0f);
cpu_spheres[8].color = float3(0.0f, 0.0f, 0.0f);
cpu_spheres[8].emission = float3(9.0f, 8.0f, 6.0f);
}
void main(){
// initialise OpenCL
initOpenCL();
// allocate memory on CPU to hold the rendered image
cpu_output = new cl_float3[image_width * image_height];
// initialise scene
const int sphere_count = 9;
Sphere cpu_spheres[sphere_count];
initScene(cpu_spheres);
// Create buffers on the OpenCL device for the image and the scene
cl_output = Buffer(context, CL_MEM_WRITE_ONLY, image_width * image_height * sizeof(cl_float3));
cl_spheres = Buffer(context, CL_MEM_READ_ONLY, sphere_count * sizeof(Sphere));
queue.enqueueWriteBuffer(cl_spheres, CL_TRUE, 0, sphere_count * sizeof(Sphere), cpu_spheres);
// specify OpenCL kernel arguments
kernel.setArg(0, cl_spheres);
kernel.setArg(1, image_width);
kernel.setArg(2, image_height);
kernel.setArg(3, sphere_count);
kernel.setArg(4, cl_output);
// every pixel in the image has its own thread or "work item",
// so the total amount of work items equals the number of pixels
std::size_t global_work_size = image_width * image_height;
std::size_t local_work_size = kernel.getWorkGroupInfo<CL_KERNEL_WORK_GROUP_SIZE>(device);
cout << "Kernel work group size: " << local_work_size << endl;
// Ensure the global work size is a multiple of local work size
if (global_work_size % local_work_size != 0)
global_work_size = (global_work_size / local_work_size + 1) * local_work_size;
cout << "Rendering started..." << endl;
// launch the kernel
queue.enqueueNDRangeKernel(kernel, NULL, global_work_size, local_work_size);
queue.finish();
cout << "Rendering done! \nCopying output from device to host" << endl;
// read and copy OpenCL output to CPU
queue.enqueueReadBuffer(cl_output, CL_TRUE, 0, image_width * image_height * sizeof(cl_float3), cpu_output);
// save image
saveImage();
cout << "Saved image to 'opencl_raytracer.ppm'" << endl;
// release memory
cleanUp();
system("PAUSE");
}
opencl_kernel.cl:
/* OpenCL based simple sphere path tracer by Sam Lapere, 2016*/
/* based on smallpt by Kevin Beason */
/* http://raytracey.blogspot.com */
__constant float EPSILON = 0.00003f; /* required to compensate for limited float precision */
__constant float PI = 3.14159265359f;
__constant int SAMPLES = 128;
typedef struct Ray{
float3 origin;
float3 dir;
} Ray;
typedef struct Sphere{
float radius;
float3 pos;
float3 color;
float3 emission;
} Sphere;
static float get_random(unsigned int *seed0, unsigned int *seed1) {
/* hash the seeds using bitwise AND operations and bitshifts */
*seed0 = 36969 * ((*seed0) & 65535) + ((*seed0) >> 16);
*seed1 = 18000 * ((*seed1) & 65535) + ((*seed1) >> 16);
unsigned int ires = ((*seed0) << 16) + (*seed1);
/* use union struct to convert int to float */
union {
float f;
unsigned int ui;
} res;
res.ui = (ires & 0x007fffff) | 0x40000000; /* bitwise AND, bitwise OR */
return (res.f - 2.0f) / 2.0f;
}
Ray createCamRay(const int x_coord, const int y_coord, const int width, const int height){
float fx = (float)x_coord / (float)width; /* convert int in range [0 - width] to float in range [0-1] */
float fy = (float)y_coord / (float)height; /* convert int in range [0 - height] to float in range [0-1] */
/* calculate aspect ratio */
float aspect_ratio = (float)(width) / (float)(height);
float fx2 = (fx - 0.5f) * aspect_ratio;
float fy2 = fy - 0.5f;
/* determine position of pixel on screen */
float3 pixel_pos = (float3)(fx2, -fy2, 0.0f);
/* create camera ray*/
Ray ray;
ray.origin = (float3)(0.0f, 0.1f, 2.0f); /* fixed camera position */
ray.dir = normalize(pixel_pos - ray.origin); /* vector from camera to pixel on screen */
return ray;
}
/* (__global Sphere* sphere, const Ray* ray) */
float intersect_sphere(const Sphere* sphere, const Ray* ray) /* version using local copy of sphere */
{
float3 rayToCenter = sphere->pos - ray->origin;
float b = dot(rayToCenter, ray->dir);
float c = dot(rayToCenter, rayToCenter) - sphere->radius*sphere->radius;
float disc = b * b - c;
if (disc < 0.0f) return 0.0f;
else disc = sqrt(disc);
if ((b - disc) > EPSILON) return b - disc;
if ((b + disc) > EPSILON) return b + disc;
return 0.0f;
}
bool intersect_scene(__constant Sphere* spheres, const Ray* ray, float* t, int* sphere_id, const int sphere_count)
{
/* initialise t to a very large number,
so t will be guaranteed to be smaller
when a hit with the scene occurs */
float inf = 1e20f;
*t = inf;
/* check if the ray intersects each sphere in the scene */
for (int i = 0; i < sphere_count; i++) {
Sphere sphere = spheres[i]; /* create local copy of sphere */
/* float hitdistance = intersect_sphere(&spheres[i], ray); */
float hitdistance = intersect_sphere(&sphere, ray);
/* keep track of the closest intersection and hitobject found so far */
if (hitdistance != 0.0f && hitdistance < *t) {
*t = hitdistance;
*sphere_id = i;
}
}
return *t < inf; /* true when ray interesects the scene */
}
/* the path tracing function */
/* computes a path (starting from the camera) with a defined number of bounces, accumulates light/color at each bounce */
/* each ray hitting a surface will be reflected in a random direction (by randomly sampling the hemisphere above the hitpoint) */
/* small optimisation: diffuse ray directions are calculated using cosine weighted importance sampling */
float3 trace(__constant Sphere* spheres, const Ray* camray, const int sphere_count, const int* seed0, const int* seed1){
Ray ray = *camray;
float3 accum_color = (float3)(0.0f, 0.0f, 0.0f);
float3 mask = (float3)(1.0f, 1.0f, 1.0f);
for (int bounces = 0; bounces < 8; bounces++){
float t; /* distance to intersection */
int hitsphere_id = 0; /* index of intersected sphere */
/* if ray misses scene, return background colour */
if (!intersect_scene(spheres, &ray, &t, &hitsphere_id, sphere_count))
return accum_color += mask * (float3)(0.15f, 0.15f, 0.25f);
/* else, we've got a hit! Fetch the closest hit sphere */
Sphere hitsphere = spheres[hitsphere_id]; /* version with local copy of sphere */
/* compute the hitpoint using the ray equation */
float3 hitpoint = ray.origin + ray.dir * t;
/* compute the surface normal and flip it if necessary to face the incoming ray */
float3 normal = normalize(hitpoint - hitsphere.pos);
float3 normal_facing = dot(normal, ray.dir) < 0.0f ? normal : normal * (-1.0f);
/* compute two random numbers to pick a random point on the hemisphere above the hitpoint*/
float rand1 = 2.0f * PI * get_random(seed0, seed1);
float rand2 = get_random(seed0, seed1);
float rand2s = sqrt(rand2);
/* create a local orthogonal coordinate frame centered at the hitpoint */
float3 w = normal_facing;
float3 axis = fabs(w.x) > 0.1f ? (float3)(0.0f, 1.0f, 0.0f) : (float3)(1.0f, 0.0f, 0.0f);
float3 u = normalize(cross(axis, w));
float3 v = cross(w, u);
/* use the coordinte frame and random numbers to compute the next ray direction */
float3 newdir = normalize(u * cos(rand1)*rand2s + v*sin(rand1)*rand2s + w*sqrt(1.0f - rand2));
/* add a very small offset to the hitpoint to prevent self intersection */
ray.origin = hitpoint + normal_facing * EPSILON;
ray.dir = newdir;
/* add the colour and light contributions to the accumulated colour */
accum_color += mask * hitsphere.emission;
/* the mask colour picks up surface colours at each bounce */
mask *= hitsphere.color;
/* perform cosine-weighted importance sampling for diffuse surfaces*/
mask *= dot(newdir, normal_facing);
}
return accum_color;
}
__kernel void render_kernel(__constant Sphere* spheres, const int width, const int height, const int sphere_count, __global float3* output){
unsigned int work_item_id = get_global_id(0); /* the unique global id of the work item for the current pixel */
unsigned int x_coord = work_item_id % width; /* x-coordinate of the pixel */
unsigned int y_coord = work_item_id / width; /* y-coordinate of the pixel */
/* seeds for random number generator */
unsigned int seed0 = x_coord;
unsigned int seed1 = y_coord;
Ray camray = createCamRay(x_coord, y_coord, width, height);
/* add the light contribution of each sample and average over all samples*/
float3 finalcolor = (float3)(0.0f, 0.0f, 0.0f);
float invSamples = 1.0f / SAMPLES;
for (int i = 0; i < SAMPLES; i++)
finalcolor += trace(spheres, &camray, sphere_count, &seed0, &seed1) * invSamples;
/* store the pixelcolour in the output buffer */
output[work_item_id] = finalcolor;
}
Since your program is working correctly on CPU but not on the GPU it could mean that you are exceeding the GPU TDR (Timeout Detection and Recovery) timer.
A cause for the Abort trap:6
error when doing computations on the GPU is locking the GPU into computation mode for too much time (a common value seems to be 5 seconds but I found contradicting resources on this number). When this occurs the watchdog will forcefully stop and restart the graphic driver to prevent the screen being stuck.
There are a couple possible solutions to this problem:
Most (if not all) OS won't enforce the TDR if no screen is attached to them
If you are working on an Nvidia Tesla GPU you can check if it's possible to switch it to Tesla Compute Cluster mode. In this mode the TDR limit is not enforced. There may be a similar mode for AMD GPUs but I'm not sure.
This can be done under Windows by editing the TdrDelay
and TdrDdiDelay
registry keys under HKEY_LOCAL_MACHINE -> SYSTEM -> CurrentControlSet -> Control -> GraphicsDrivers
with a higher value. Beware to not put a number too high or you won't be able to know if the driver has really crashed.
Also take note that graphic drivers or Windows updates may reset these values to default.
Under Linux the TDR should already be disable by default (I know it is under Ubuntu 18 and Centos 8 but I haven't tested on other versions/distros), if you have problems anyway you can add Option Interactive "0"
in your Xorg
config like stated in this SO question
Unfortunately I don't know (and couldn't find) a way to do this on MacOS, however I do know that this limit is not enforced on a secondary GPU if you have it installed in your MacOS system.
If you can manage to split your computation into smaller chunks you may be able to not surpass the TDR timer (E.G. 2 computations that take 4s each instead of a single 8s one), the feasibility of this depends on what your problem is and may or may not be an easy task though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With