Most modern CMOS camera can produce 12bit bayered images. What would be the fastest way to convert an image data array of 12bit to 16bit so processing would be possible? The actual problem is padding each 12bit number with 4 zeros, little endian can be assumed, SSE2/SSE3/SS4 also acceptable.
Code added:
int* imagePtr = (int*)Image.data;
fixed (float* imageData = img.Data)
{
float* imagePointer = imageData;
for (int t = 0; t < total; t++)
{
int i1 = *imagePtr;
imagePtr = (int*)((ushort*)imagePtr + 1);
int i2 = *imagePtr;
imagePtr = (int*)((ushort*)imagePtr + 2);
*imagePointer = (float)(((i1 << 4) & 0x00000FF0) | ((i1 >> 8) & 0x0000000F));
imagePointer++;
*imagePointer = (float)((i1 >> 12) & 0x00000FFF);
imagePointer++;
*imagePointer = (float)(((i2 >> 4) & 0x00000FF0) | ((i2 >> 12) & 0x0000000F));
imagePointer++;
*imagePointer = (float)((i2 >> 20) & 0x00000FFF);
imagePointer++;
}
}
I cannot guarantee fastest, but this is an approach that uses SSE. Eight 12-16bit conversions are done per iteration and two conversions (approx) are done per step (ie, each iteration takes multiple steps).
This approach straddles the 12bit integers around the 16bit boundaries in the xmm register. Below shows how this is done.
:
load
AAAB BBCC CDDD EEEF FFGG GHHH JJJK KKLL
^^^
>>2
00AA ABBB CCCD DDEE EFFF GGGH HHJJ JKKK
^^^ ^^^
>>2
0000 AAAB BBCC CDDD EEEF FFGG GHHH JJJK
^^^ ^^^
>>2
0000 00AA ABBB CCCD DDEE EFFF GGGH HHJJ
^^^ ^^^
>>2
0000 0000 AAAB BBCC CDDD EEEF FFGG GHHH
^^^
At each step, we can extract the aligned 12bit integers and store them in the xmm1 register. At the end, our xmm1 will look as follows. Question marks denote values which we do not care about.
AAA? ?BBB CCC? ?DDD EEE? ?FFF GGG? ?HHH
Extract the high aligned integers (A, C, E, G) into xmm2 and then, on xmm2, perform a right logical word shift of 4 bits. This will convert the high aligned integers to low aligned. Blend these adjusted integers back into xmm1. The state of xmm1 is now:
?AAA ?BBB ?CCC ?DDD ?EEE ?FFF ?GGG ?HHH
Finally we can mask out the integers (ie, convert the ?'s to 0's) with 0FFFh on each word.
0AAA 0BBB 0CCC 0DDD 0EEE 0FFF 0GGG 0HHH
Now xmm1 contains eight consecutive converted integers.
The following NASM program demonstrates this algorithm.
global main
segment .data
sample dw 1234, 5678, 9ABCh, 1234, 5678, 9ABCh, 1234, 5678
low12 times 8 dw 0FFFh
segment .text
main:
movdqa xmm0, [sample]
pblendw xmm1, xmm0, 10000000b
psrldq xmm0, 1
pblendw xmm1, xmm0, 01100000b
psrldq xmm0, 1
pblendw xmm1, xmm0, 00011000b
psrldq xmm0, 1
pblendw xmm1, xmm0, 00000110b
psrldq xmm0, 1
pblendw xmm1, xmm0, 00000001b
pblendw xmm2, xmm1, 10101010b
psrlw xmm2, 4
pblendw xmm1, xmm2, 10101010b
pand xmm1, [low12] ; low12 could be stored in another xmm register
I'd try to build a solution around the SSSE3 instruction PSHUFB
;
Given A=[a0, a1, a2, a3 ... a7], B=[b0, b1, b2, .. b7];
PSHUFB(A,B) = [a_b0, a_b1, a_b2, ... a_b7],
except that the result byte will be zero, if the top bit of bX is 1.
Thus, if
A = [aa ab bb cc cd dd ee ef] == input vector
C=PSHUFB(A, [0 1 1 2 3 4 4 5]) = [aa ab ab bb cc cd cd dd]
C=PSRLW (C, [4 0 4 0]) = [0a aa ab bb 0c cc cd dd] // (>> 4)
C=PSLLW (C, 4) = [aa a0 bb b0 cc c0 dd d0] // << by immediate
A complete solution would read in 3 or 6 mmx / xmm registers and output 4/8 mmx/xmm registers each round. The middle two outputs will have to be combined from two input chunks, requiring some extra copying and combining of registers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With