I'm wondering if there is any chance to improve performance of such compacting. The idea is to saturate values higher than 4095 and place each value every 12 bits in new continuous buffer. Just like that:
Concept:
Convert:
Input buffer: [0.0][0.1][0.2] ... [0.15] | [1.0][1.1][1.2] ... [1.15] | [2.0][2.1][2.2] ... [2.15] etc ...
to:
Output buffer: [0.0][0.1][0.2] ... [0.11] | [1.0][1.1][1.2] ... [1.11] | [2.0][2.1][2.2] ... [2.11] etc ...
The input and output buffers are defines as:
uint16_t input[76800] (it's size in Bytes equal 153600 Bytes)
uint24_t output[38400] (it's size in Bytes equal 115200 Bytes)
So I have reduced the data size by 1/4. This computation cost ~1ms on Cortex-A9 with 792 MHz CPU speed and 2 Cores. I have to perform such "compression" because I transfer about 18MB/s over Ethernet and that gives me huge overhead. I've tested various compression algorithms such Snappy, LZ4 and none of that was even close to achieved 1 ms with saturation and bits schifting.
I've written the following code:
#pragma pack(push, 1)
typedef struct {
union {
struct {
uint32_t value0_24x1:24;
};
struct {
uint32_t value0_12x1:12;
uint32_t value1_12x1:12;
};
struct {
uint32_t value0_8x1:8;
uint32_t value1_8x1:8;
uint32_t value3_8x1:8;
};
};
} uint24_t;
#pragma pack(pop)
static inline uint32_t __attribute__((always_inline)) saturate(uint32_t value)
{
register uint32_t result;
asm volatile("usat %0, %2, %1 \n\t" \
: [result] "=r" (result) \
: [value] "r" (value), [saturate] "I" (12) \
: \
);
return result;
}
void __attribute__((noinline, used)) compact(const uint16_t *input, uint24_t *output, uint32_t elements)
{
#if 0
/* More readable, but slower */
for (uint32_t i = 0; i < elements; ++i) {
output->value0_12x1 = saturate(*input++);
(output++)->value1_12x1 = saturate(*input++);
}
#else
/* Alternative - less readable but faster */
for (uint32_t i = 0; i < elements; ++i, input += 2)
(output++)->value0_24x1 = saturate(*input) | ((uint32_t)saturate(*(input+1))) << 12;
#endif
}
static uint16_t buffer_in[76800] = {0};
static uint24_t buffer_out[38400] = {0};
int main()
{
/* Dividing by 2 because we process two input values in a single loop inside compact() */
compact(buffer_in, buffer_out, sizeof(buffer_in) / sizeof(buffer_in[0]) / 2);
return 0;
}
And it's Assembly:
248 00008664 <compact>:
249 8664: e92d4010 push {r4, lr}
250 8668: e3a03000 mov r3, #0
251 866c: ea00000c b 86a4 <compact+0x40>
252 8670: e1d040b0 ldrh r4, [r0]
253 8674: e6ec4014 usat r4, #12, r4
254 8678: e1d0c0b2 ldrh ip, [r0, #2]
255 867c: e6ecc01c usat ip, #12, ip
256 8680: e184c60c orr ip, r4, ip, lsl #12
257 8684: e2833001 add r3, r3, #1
258 8688: e2800004 add r0, r0, #4
259 868c: e5c1c000 strb ip, [r1]
260 8690: e7e7445c ubfx r4, ip, #8, #8
261 8694: e7e7c85c ubfx ip, ip, #16, #8
262 8698: e5c14001 strb r4, [r1, #1]
263 869c: e5c1c002 strb ip, [r1, #2]
264 86a0: e2811003 add r1, r1, #3
265 86a4: e1530002 cmp r3, r2
266 86a8: 1afffff0 bne 8670 <compact+0xc>
267 86ac: e8bd8010 pop {r4, pc}
Compiled using GCC 4.6.3 with the following CFLAGS:
-Os (-O2 and -O3 do not give any noticable improvements)
-march=armv7-a -mcpu=cortex-a9 -mtune=cortex-a9
-marm -mfloat-abi=softfp -mfpu=neon funsafe-math-optimizations
Benchmark has shown that we're using ~10.3 cycles per 1 data convertion.
The questions are:
Some code example would be very welcome, because I'm completly noob when it comes to NEON.
Here are the answers :
Yes, it will be blazingly fast.
You should avoid intrinsics at all costs. It isn't worth the effort. Go for assembly
I'll give you a sample implementation once I arrive home.
////////////////////////////////////////////////////
Ok, here it goes : You want to pack 16 bits to 12 bits. It's a ratio of 4:3.
Therefore, it's wise to load data 4 spread and store them 3 spread : vld4.16 -> vst3.16
/*
* void fanic_pack16to12(unsigned short * pDst, unsigned short * pSrc, unsigned int count);
* assert :
* count >= 64
* count % 4 == 0
*
* written by : Jake Lee
* part of FANIC project - Fastest ARM NEON Implementation Challenge
*/
pDst .req r0
pSrc .req r1
count .req r2
.text
.arm
.global fanic_pack16to12:
.func
.align 5
fanic_pack16to12:
pld [pSrc]
pld [pSrc, #64]
pld [pSrc, #128]
pld [pSrc, #192]
pld [pSrc, #256]
sub count, count, #64
.align 5
1:
vld4.16 {d16, d18, d20, d22}, [pSrc]!
vld4.16 {d17, d19, d21, d23}, [pSrc]!
vld4.16 {d24, d26, d28, d30}, [pSrc]!
vld4.16 {d25, d27, d29, d31}, [pSrc]!
pld [pSrc, #128]
pld [pSrc, #192]
subs count, count, #64
vqshl.u16 q0, q8, #4
vqshl.u16 q3, q9, #4
vqshl.u16 q8, q10, #4
vqshl.u16 q9, q11, #4
vqshl.u16 q10, q12, #4
vqshl.u16 q13, q13, #4
vqshl.u16 q14, q14, #4
vqshl.u16 q15, q15, #4
vshl.u16 q1, q3, #4
vshl.u16 q2, q8, #8
vshl.u16 q11, q13, #4
vshl.u16 q12, q14, #8
vsri.16 q0, q3, #12
vsri.16 q1, q8, #8
vsri.16 q2, q9, #4
vsri.16 q10, q13, #12
vsri.16 q11, q14, #8
vsri.16 q12, q15, #4
vst3.16 {d0, d2, d4}, [pDst]!
vst3.16 {d1, d3, d5}, [pDst]!
vst3.16 {d20, d22, d24}, [pDst]!
vst3.16 {d21, d23, d25}, [pDst]!
bpl 1b
cmp count, #-64
add pDst, pDst, count
bxle lr
add pSrc, pSrc, count, lsl #1
add pDst, pDst, count, asr #1
b 1b
.endfunc
.end
Please note how many cycles and bandwidth are saved thanks to smart register allocation and loop control - practices that are simply impossible with intrinsics.
This implementation will run so fast as if done by a dedicated hardware.
Have fun!
//////////////////////////////////////////////////////
Ok, below is the unpacking function :
/*
* void fanic_unpack12to16(unsigned short *pDst, unsigned short *pSrc, unsigned int count);
* assert :
* count >=64
* count % 4 == 0
*
* written by : Jake Lee
* part of FANIC project - Fastest ARM NEON Implementation Challenge
*/
pDst .req r0
pSrc .req r1
count .req r2
.text
.arm
.global fanic_unpack12to16:
.func
.align 5
fanic_unpack12to16:
pld [pSrc]
pld [pSrc, #64*1]
pld [pSrc, #64*2]
vpush {q4}
pld [pSrc, #64*3]
vmov.i16 q4, #0x0fff
pld [pSrc, #64*4]
sub count, count, #64
.align 5
1:
vld3.16 {d20, d22, d24}, [pSrc]!
vld3.16 {d21, d23, d25}, [pSrc]!
vld3.16 {d26, d28, d30}, [pSrc]!
vld3.16 {d27, d29, d31}, [pSrc]!
pld [pSrc, #128]
pld [pSrc, #192]
subs count, count, #64
vshr.u16 q1, q11, #8
vshr.u16 q2, q12, #12
vshr.u16 q0, q10, #4
vand q3, q12, q4
vshr.u16 q9, q14, #8
vsli.16 q1, q10, #8
vsli.16 q2, q11, #4
vshr.u16 q10, q15, #12
vsli.16 q9, q13, #8
vbic.i16 q1, q1, #0xf000
vbic.i16 q2, q2, #0xf000
vsli.16 q10, q14, #4
vshr.u16 q8, q13, #4
vbic.i16 q9, q9, #0xf000
vand q11, q15, q4
vbic.i16 q10, q10, #0xf000
vst4.16 {d0, d2, d4, d6}, [pDst]!
vst4.16 {d1, d3, d5, d7}, [pDst]!
vst4.16 {d16, d18, d20, d22}, [pDst]!
vst4.16 {d17, d19, d21, d23}, [pDst]!
bpl 1b
cmp count, #-64
add pSrc, pSrc, count
vpople {q4}
bxle lr
add pSrc, pSrc, count, asr #1
add pDst, pDst, count, lsl #1
b 1b
.endfunc
.end
Tweak points :
then guarantee all the memory related instructions alignments. 256bit for 4 spread, 64bit for 3 spread like following :
vld4.16 {d16, d18, d20, d22}, [pSrc,:256]!
..
vst3.16 {d0, d2, d4}, [pDst,:64]!
..
make count a multiple of 64. otherwise, you'll have to write extra codes dealing with residual data (the current one would crash due to alignment fault)
This will improve the performance by a good margin if not huge.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With