Quick Summary:
I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements?
Details:
I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixels (either as YUV or RGB images), but DX10 takes 32-bit pixel textures. So, I have to expand the 24-bit values to 32-bits before I can load them into the GPU.
I really don't care what I set the remaining 8 bits to, or where the incoming 24-bits are in that 32-bit value - I can fix all that in a pixel shader. But I need to do the conversion from 24-bit to 32-bit really quickly.
I'm not terribly familiar with SIMD SSE operations, but from my cursory glance it doesn't look like I can do the expansion using them, given my reads and writes aren't the same size. Any suggestions? Or am I stuck sequentially massaging this data set?
This feels so very silly - I'm using the pixel shaders for parallelism, but I have to do a sequential per-pixel operation before that. I must be missing something obvious...
The code below should be pretty fast. It copies 4 pixels in each iteration, using only 32-bit read/write instructions. The source and destination pointers should be aligned to 32 bits.
uint32_t *src = ...;
uint32_t *dst = ...;
for (int i=0; i<num_pixels; i+=4) {
uint32_t sa = src[0];
uint32_t sb = src[1];
uint32_t sc = src[2];
dst[i+0] = sa;
dst[i+1] = (sa>>24) | (sb<<8);
dst[i+2] = (sb>>16) | (sc<<16);
dst[i+3] = sc>>8;
src += 3;
}
Edit:
Here is a way to do this using the SSSE3 instructions PSHUFB and PALIGNR. The code is written using compiler intrinsics, but it shouldn't be hard to translate to assembly if needed. It copies 16 pixels in each iteration. The source and destination pointers Must be aligned to 16 bytes, or it will fault. If they aren't aligned, you can make it work by replacing _mm_load_si128
with _mm_loadu_si128
and _mm_store_si128
with _mm_storeu_si128
, but this will be slower.
#include <emmintrin.h>
#include <tmmintrin.h>
__m128i *src = ...;
__m128i *dst = ...;
__m128i mask = _mm_setr_epi8(0,1,2,-1, 3,4,5,-1, 6,7,8,-1, 9,10,11,-1);
for (int i=0; i<num_pixels; i+=16) {
__m128i sa = _mm_load_si128(src);
__m128i sb = _mm_load_si128(src+1);
__m128i sc = _mm_load_si128(src+2);
__m128i val = _mm_shuffle_epi8(sa, mask);
_mm_store_si128(dst, val);
val = _mm_shuffle_epi8(_mm_alignr_epi8(sb, sa, 12), mask);
_mm_store_si128(dst+1, val);
val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sb, 8), mask);
_mm_store_si128(dst+2, val);
val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sc, 4), mask);
_mm_store_si128(dst+3, val);
src += 3;
dst += 4;
}
SSSE3 (not to be confused with SSE3) will require a relatively new processor: Core 2 or newer, and I believe AMD doesn't support it yet. Performing this with SSE2 instructions only will take a lot more operations, and may not be worth it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With