Traversal of Bounding Volume Hierachy in Shaders

Tags:

I am working on a path tracer using vulkan compute shaders. I implemented a tree representing a bounding volume hierachy. The idea of the BVH is to minimize the amount of objects a ray intersection test needs to be performed on.

#1 Naive Implementation

My first implementation is very fast, it traverses the tree down to a single leaf of the BVH tree. However, the ray might intersect multiple leaves. This code then leads to some triangles not being rendered (although they should).

int box_index = -1;

for (int i = 0; i < boxes_count; i++) {
    // the first box has no parent, boxes[0].parent is set to -1
    if (boxes[i].parent == box_index) {
        if (intersect_box(boxes[i], ray)) {
            box_index = i;
        }
    }
}

if (box_index > -1) {
    uint a = boxes[box_index].ids_offset;
    uint b = a + boxes[box_index].ids_count;

    for (uint j = a; j < b; j++) {
        uint triangle_id = triangle_references[j];
        // triangle intersection code ...
    }
}

#2 Multi-Leaf Implementation

My second implementation accounts for the fact that multiple leaves might be intersected. However, this implementation is 36x slower than implementation #1 (okay, I miss some intersection tests in #1, but still...).

bool[boxes.length()] hits;
hits[0] = intersect_box(boxes[0], ray);

for (int i = 1; i < boxes_count; i++) {
    if (hits[boxes[i].parent]) {
        hits[i] = intersect_box(boxes[i], ray);
    } else {
        hits[i] = false;
    }
}

for (int i = 0; i < boxes_count; i++) {
    if (!hits[i]) {
        continue;
    }

    // only leaves have ids_offset and ids_count defined (not set to -1)
    if (boxes[i].ids_offset < 0) {
        continue;
    }

    uint a = boxes[i].ids_offset;
    uint b = a + boxes[i].ids_count;

    for (uint j = a; j < b; j++) {
        uint triangle_id = triangle_references[j];
        // triangle intersection code ...
    }
}

This performance difference drives me crazy. It seems only having a single statement like if(dynamically_modified_array[some_index]) has a huge impact on performance. I suspect that the SPIR-V or GPU compiler is no longer able to do its optimization magic? So here are my questions:

Is this indeed an optimization problem?
If yes, can I transform implementation #2 to be better optimizable? Can I somehow give optimization hints?
Is there a standard way to implement BVH tree queries in shaders?

709

asked Apr 02 '19 16:04

jns

1 Answers

After some digging, I found a solution. Important to understand is that the BVH tree does not exclude the possibility that one needs to evaluate all leaves.

Implementation #3 below, uses hit and miss links. The boxes need to be sorted in a way that in the worst case all of them are queried in the correct order (so a single loop is enough). However, links are used to skip nodes which don't need to be evaluated. When the current node is a leaf node, the actual triangle intersections are performed.

hit link ~ which node to jump to in case of a hit (green below)
miss link ~ which node to jump to in case of a miss (red below)

BVH tree evaluation order

Image taken from here. The associated paper and source code is also on Prof. Toshiya Hachisuka's page. The same concept is also described in this paper referenced in the slides.

#3 BVH Tree with Hit and Miss Links

I had to extend the data which is pushed to the shader with the links. Also some offline fiddling was required to store the tree correctly. At first I tried using a while loop (loop until box_index_next is -1) which resulted in a crazy slowdown again. Anyway, the following works reasonably fast:

int box_index_next = 0;

for (int box_index = 0; box_index < boxes_count; box_index++) {
    if (box_index != box_index_next) {
        continue;
    }

    bool hit = intersect_box(boxes[box_index], ray);
    bool leaf = boxes[box_index].ids_count > 0;

    if (hit) {
        box_index_next = boxes[box_index].links.x; // hit link
    } else {
        box_index_next = boxes[box_index].links.y; // miss link
    }

    if (hit && leaf) {
        uint a = boxes[box_index].ids_offset;
        uint b = a + boxes[box_index].ids_count;

        for (uint j = a; j < b; j++) {
            uint triangle_id = triangle_references[j];
            // triangle intersection code ...
        }
    }
}

This code is about 3x slower than the fast, but flawed implementation #1. This is somewhat expected, now the speed depends on the actual tree, not on the gpu optimization. Consider, for example, a degenerate case where triangles are aligned along an axis: a ray in the same direction might intersect with all triangles, then all tree leaves need to be evaluated.

Prof. Toshiya Hachisuka proposes a further optimization for such cases in his sildes (page 36 and onward): One stores multiple versions of the BVH tree, spatially sorted along x, -x, y, -y, z and -z. For traversal the correct version needs to be selected based on the ray. Then one can stop the traversal as soon as a triangle from a leaf is intersected, since all remaining nodes to be visited will be spatially behind this node (from the ray point of view).

Once the BVH tree is built, finding the links is quite straightforward (some python code below):

class NodeAABB(object):

    def __init__(self, obj_bounds, obj_ids):
        self.children = [None, None]
        self.obj_bounds = obj_bounds
        self.obj_ids = obj_ids

    def split(self):
        # split recursively and create children here
        raise NotImplementedError()

    def is_leaf(self):
        return set(self.children) == {None}

    def build_links(self, next_right_node=None):
        if not self.is_leaf():
            child1, child2 = self.children

            self.hit_node = child1
            self.miss_node = next_right_node

            child1.build_links(next_right_node=child2)
            child2.build_links(next_right_node=next_right_node)

        else:
            self.hit_node = next_right_node
            self.miss_node = self.hit_node

    def collect(self):
        # retrieve in depth first fashion for correct order
        yield self
        if not self.is_leaf():
            child1, child2 = self.children
            yield from child1.collect()
            yield from child2.collect()

After you store all AABBs in an array (which will be sent to the GPU) you can use hit_node and miss_node to look up the indices for the links and store them as well.

103

answered Oct 23 '22 19:10

jns

Related questions
                            
                                glsl refraction getting mapped upside down
                            
                                OpenGL Shadow Mapping using GLSL
                            
                                Using quaternions for tangent space normal mapping - Problems I'm having
                            
                                The result of own double precision cos() implemention in a shader is NaN, but works well on the CPU. What is going wrong?
                            
                                Layered rendering to CUBEMAP using geometry shader
                            
                                Heat haze/distortion effect in OpenGL (GLSL) and how it should be achieved
                            
                                GLSL Matrix/Inverse multiplication precision
                            
                                Best way to sample a fullscreen texture
                            
                                Pixelated lighting shader
                            
                                How to pass multiple uniforms efficiently and dynamically to GLSL
                            
                                Can someone please explain this Fragment Shader? It is a Chroma Key Filter (Green screen effect)
                            
                                Translating GLSL to C++ float / vec3?
                            
                                Which memory barrier does glGenerateMipmap require?
                            
                                Efficient way to manage matrices within a graphic application using Texture Buffer Object(s) (OpenGL)
                            
                                Transformation in vertex shader only works with post-multiplying
                            
                                Sum image intensities in GPU
                            
                                Can GLSL output to two/multiple textures at the same time?
                            
                                Compiling shaders in PyQt

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Traversal of Bounding Volume Hierachy in Shaders

Tags:

compiler-optimization

glsl

bounding-box

raytracing

vulkan

jns

People also ask

1 Answers

jns

Recent Activity

Donate For Us