I am porting some code I wrote to NEON using inline assembly.
One of the things I need is to convert byte values ranging [0..128] to other byte values in a table which take the full range [0..255]
The table is short but the math behind this is not easy so I think it is not worth calculating it each time "on the fly". So I want to try Look Up tables.
I have used VTBL for a 32byte case, and works as expected
For the full range, one idea would be to first compare the range where the source is and do different lookups (i.e, having 4 32-bit lookup tables).
My question is: Is there any more efficient way to do it?
EDIT
After some trials, I have done it with four look-ups and (still not scheduled) I am happy with the results. I leave here a piece of the code lines in inline assembly, just in case someone may find it useful or thinks it can be improved.
// Have the original data in d0
// d1 holds #32 value 
// d6,d7,d8,d9 has the images for the values [0..31] 
    //First we look for the 0..31 images. The values out of range will be 0
    "vtbl.u8 d2,{d6,d7,d8,d9},d0    \n\t"
    // Now we sub #32 to d1 and find the images for [32...63], which have been previously loaded in d10,d11,d12,d13
    "vsub.u8 d0,d0,d1\n\t"              
    "vtbl.u8 d3,{d10,d11,d12,d13},d1    \n\t"
    // Do the same and calculating images for [64..95]
    "vsub.u8 d0,d0,d1\n\t"
    "vtbl.u8 d4,{d14,d15,d16,d17},d0    \n\t"
    // Last step: images for [96..127]
    "vsub.u8 d0,d0,d1\n\t"
    "vtbl.u8 d5,{d18,d19,d20,d21},d0    \n\t"
    // Now we add all. No need to saturate, since only one will be different than zero each time
    "vadd.u8 d2,d2,d3\n\t"
    "vadd.u8 d4,d4,d5\n\t"
    "vadd.u8 d2,d2,d4\n\t"   // Leave the result in d2
The proper sequence is through
vtbl d0, { d2,d3,d4,d5 }, d1   // first value
vsub d1, d1, d31               // decrement index
vtbx d0, { d6,d7,d8,d9 }, d1   // all the subsequent values
vsub d1, d1, d31               // decrement index
vtbx d0, { q5,q6 }, d1         // q5 = d10,d11
vsub d1, d1, d31
vtbx d0, { q7,q8 }, d1
The difference between vtbl and vtbx is that vtbl zeroes the element d0, when d1 >= 32, where as vtbx leaves the original value in d0 intact. Thus there's no need for the trickery as in my comment and no need to merge the partial values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With