Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RenderScript Bound Pointers vs. Allocations

Does RenderScript guarantee the memory layout or stride in global pointers bound from the Java layer?

I read somewhere that it is best to use rsGetElementAt / rsSetElementAt functions because the layout is not guaranteed.

But elsewhere it was said to avoid those when targetting GPU optimizations, whereas bound pointers are ok.


In my particular case, I need the kernel to access the value of many surrounding pixels. So far, I have done quite well with float pointers bound from the Java layer.

Java:

script.set_width(inputWidth);
script.bind_input(inputAllocation);

RS:

int width;
float *input;

void root(const float *v_in, float *v_out, uint32_t x, uint32_t y) {
    int current = x + width * y;
    int above   = current - width;
    int below   = current + width;

    *v_out = input[above   - 1] + input[above  ] + input[above   + 1] +
             input[current - 1] + input[current] + input[current + 1] +
             input[below   - 1] + input[below  ] + input[below   + 1] ;
}

This is a trivial simplification of what I'm actually doing, just to easily illustrate with an example. In reality, I'm doing far more of these combinations and with multiple input images at the same time, so much so, that simly pre-computing the positions for the "above" and "below" rows helps a great deal with the processing time.

As long as memory is guaranteed to be sequential and in the same order you'd normally expect, all is good, and so far I haven't had any problems on my test devices.


But if this memory layout is truly not guaranteed across all devices/processors, and the stride can actually vary, then my code would obviously break and I'd be forced to use rsGetElementAt, such as:

Java:

script.set_input(inputAllocation);

RS:

rs_allocation input;

void root(const float *v_in, float *v_out, uint32_t x, uint32_t y) {
    *v_out = rsGetElementAt_float(input, x - 1, y - 1) + rsGetElementAt_float(input, x, y - 1) + rsGetElementAt_float(input, x + 1, y - 1) + 
             rsGetElementAt_float(input, x - 1, y    ) + rsGetElementAt_float(input, x, y    ) + rsGetElementAt_float(input, x + 1, y    ) + 
             rsGetElementAt_float(input, x - 1, y + 1) + rsGetElementAt_float(input, x, y + 1) + rsGetElementAt_float(input, x + 1, y + 1) ;
}

The average execution time of the script using rsGetElementAt() (710 ms) is almost twice as much as that of the kernel using input[] (390 ms), I'm guessing because each call must independently re-compute the memory offset for the given x,y coordinates.

My script needs to run continuously, so I'm trying to get every possible bit of performance out of it, and it would be a real pity to ignore such a considerable speedup.


So I'm wondering if anyone could shed some light on this.

Are there really any cases under which bound pointers will not be fully sequential, and is there a way to force them to be?

Is rsGetElementAt() truly necessary in this case, or is it safe to keep using bound pointers relying on a pre-defined stride?

like image 426
monoeci Avatar asked Oct 02 '22 23:10

monoeci


1 Answers

Bound pointers are only guaranteed to be sequential for simple 1D allocations. Any type with more than one dimension should be accessed with get/setElementAt_.

Comments on performance:

rsGetElementAt_float() will typically outperform rsGetElementAt() because it knows the type and can avoid the lookup for stride. This is true of all the typed get/set methods.

Which OS version are you testing on? 4.4 brought some major improvements to this type of code which should be able to pull the address calculations out of the loops for many cases.

The manipulate the pointers approach will force some GPU driver to fallback to the safe path.

Some newer drivers (4.4.1) will be using the HW address calculation unit removing the overhead completely.

like image 76
R. Jason Sams Avatar answered Oct 20 '22 22:10

R. Jason Sams