ARM NEON: How to implement a 256bytes Look Up table

Question

I am porting some code I wrote to NEON using inline assembly.

One of the things I need is to convert byte values ranging [0..128] to other byte values in a table which take the full range [0..255]

The table is short but the math behind this is not easy so I think it is not worth calculating it each time "on the fly". So I want to try Look Up tables.

I have used VTBL for a 32byte case, and works as expected

For the full range, one idea would be to first compare the range where the source is and do different lookups (i.e, having 4 32-bit lookup tables).

My question is: Is there any more efficient way to do it?

EDIT

After some trials, I have done it with four look-ups and (still not scheduled) I am happy with the results. I leave here a piece of the code lines in inline assembly, just in case someone may find it useful or thinks it can be improved.

// Have the original data in d0
// d1 holds #32 value 
// d6,d7,d8,d9 has the images for the values [0..31] 

    //First we look for the 0..31 images. The values out of range will be 0
    "vtbl.u8 d2,{d6,d7,d8,d9},d0    
	"

    // Now we sub #32 to d1 and find the images for [32...63], which have been previously loaded in d10,d11,d12,d13
    "vsub.u8 d0,d0,d1
	"              
    "vtbl.u8 d3,{d10,d11,d12,d13},d1    
	"

    // Do the same and calculating images for [64..95]
    "vsub.u8 d0,d0,d1
	"
    "vtbl.u8 d4,{d14,d15,d16,d17},d0    
	"

    // Last step: images for [96..127]
    "vsub.u8 d0,d0,d1
	"
    "vtbl.u8 d5,{d18,d19,d20,d21},d0    
	"

    // Now we add all. No need to saturate, since only one will be different than zero each time
    "vadd.u8 d2,d2,d3
	"
    "vadd.u8 d4,d4,d5
	"
    "vadd.u8 d2,d2,d4
	"   // Leave the result in d2

Aki Suihkonen · Accepted Answer

The proper sequence is through

vtbl d0, { d2,d3,d4,d5 }, d1   // first value
vsub d1, d1, d31               // decrement index
vtbx d0, { d6,d7,d8,d9 }, d1   // all the subsequent values
vsub d1, d1, d31               // decrement index
vtbx d0, { q5,q6 }, d1         // q5 = d10,d11
vsub d1, d1, d31
vtbx d0, { q7,q8 }, d1

The difference between vtbl and vtbx is that vtbl zeroes the element d0, when d1 >= 32, where as vtbx leaves the original value in d0 intact. Thus there's no need for the trickery as in my comment and no need to merge the partial values.

ARM NEON: How to implement a 256bytes Look Up table

Tags:

optimization

assembly

arm

neon

Jordi C.

1 Answers

Aki Suihkonen

Recent Activity

Donate For Us

ARM NEON: How to implement a 256bytes Look Up table

Tags:

optimization

assembly

arm

neon

Jordi C.

1 Answers

Aki Suihkonen

Related questions

Recent Activity

Donate For Us