Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to convert 12bit image to 16bit image

Most modern CMOS camera can produce 12bit bayered images. What would be the fastest way to convert an image data array of 12bit to 16bit so processing would be possible? The actual problem is padding each 12bit number with 4 zeros, little endian can be assumed, SSE2/SSE3/SS4 also acceptable.

Code added:

int* imagePtr = (int*)Image.data;
fixed (float* imageData = img.Data)
{
   float* imagePointer = imageData;
   for (int t = 0; t < total; t++)
      {
         int i1 = *imagePtr;
         imagePtr = (int*)((ushort*)imagePtr + 1);
         int i2 = *imagePtr;
         imagePtr = (int*)((ushort*)imagePtr + 2);
         *imagePointer = (float)(((i1 << 4) & 0x00000FF0) | ((i1 >> 8) & 0x0000000F));
         imagePointer++;
         *imagePointer = (float)((i1 >> 12) & 0x00000FFF);
         imagePointer++;
         *imagePointer = (float)(((i2 >> 4) & 0x00000FF0) | ((i2 >> 12) & 0x0000000F));
         imagePointer++;
         *imagePointer = (float)((i2 >> 20) & 0x00000FFF);
         imagePointer++;
      }
  }
like image 809
Gilad Avatar asked Mar 15 '13 23:03

Gilad


2 Answers

I cannot guarantee fastest, but this is an approach that uses SSE. Eight 12-16bit conversions are done per iteration and two conversions (approx) are done per step (ie, each iteration takes multiple steps).

This approach straddles the 12bit integers around the 16bit boundaries in the xmm register. Below shows how this is done.

  • One xmm register is being used (assume xmm0). The state of the register is represented by one line of letters.
  • Each letter represents 4 bits of a 12bit integer (ie, AAA is the entire first 12bit word in the array).
  • Each gap represents a 16-bit boundary.
  • >>2 indicates a logical right-shift of one byte.
  • The carrot (^) symbol is used to highlight which relevant 12bit integers are straddling a 16bit boundary in each step.

:

load
AAAB BBCC CDDD EEEF FFGG GHHH JJJK KKLL
^^^

>>2
00AA ABBB CCCD DDEE EFFF GGGH HHJJ JKKK
      ^^^ ^^^    

>>2
0000 AAAB BBCC CDDD EEEF FFGG GHHH JJJK
                ^^^ ^^^    

>>2
0000 00AA ABBB CCCD DDEE EFFF GGGH HHJJ
                          ^^^ ^^^    

>>2
0000 0000 AAAB BBCC CDDD EEEF FFGG GHHH
                                    ^^^

At each step, we can extract the aligned 12bit integers and store them in the xmm1 register. At the end, our xmm1 will look as follows. Question marks denote values which we do not care about.

AAA? ?BBB CCC? ?DDD EEE? ?FFF GGG? ?HHH

Extract the high aligned integers (A, C, E, G) into xmm2 and then, on xmm2, perform a right logical word shift of 4 bits. This will convert the high aligned integers to low aligned. Blend these adjusted integers back into xmm1. The state of xmm1 is now:

?AAA ?BBB ?CCC ?DDD ?EEE ?FFF ?GGG ?HHH

Finally we can mask out the integers (ie, convert the ?'s to 0's) with 0FFFh on each word.

0AAA 0BBB 0CCC 0DDD 0EEE 0FFF 0GGG 0HHH

Now xmm1 contains eight consecutive converted integers.

The following NASM program demonstrates this algorithm.

global main

segment .data
sample dw 1234, 5678, 9ABCh, 1234, 5678, 9ABCh, 1234, 5678
low12 times 8 dw 0FFFh

segment .text
main:

  movdqa xmm0, [sample]

  pblendw xmm1, xmm0, 10000000b
  psrldq xmm0, 1
  pblendw xmm1, xmm0, 01100000b
  psrldq xmm0, 1
  pblendw xmm1, xmm0, 00011000b
  psrldq xmm0, 1
  pblendw xmm1, xmm0, 00000110b
  psrldq xmm0, 1
  pblendw xmm1, xmm0, 00000001b

  pblendw xmm2, xmm1, 10101010b
  psrlw xmm2, 4

  pblendw xmm1, xmm2, 10101010b

  pand xmm1, [low12]        ; low12 could be stored in another xmm register
like image 76
erisco Avatar answered Oct 22 '22 23:10

erisco


I'd try to build a solution around the SSSE3 instruction PSHUFB;

Given A=[a0, a1, a2, a3 ... a7], B=[b0, b1, b2, .. b7];

 PSHUFB(A,B) = [a_b0, a_b1, a_b2, ... a_b7],

except that the result byte will be zero, if the top bit of bX is 1.

Thus, if

     A  = [aa ab bb cc cd dd ee ef] == input vector

C=PSHUFB(A, [0 1 1 2 3 4 4 5]) = [aa ab ab bb cc cd cd dd]
C=PSRLW (C, [4 0 4 0])         = [0a aa ab bb 0c cc cd dd] // (>> 4)
C=PSLLW (C, 4)                 = [aa a0 bb b0 cc c0 dd d0] // << by immediate

A complete solution would read in 3 or 6 mmx / xmm registers and output 4/8 mmx/xmm registers each round. The middle two outputs will have to be combined from two input chunks, requiring some extra copying and combining of registers.

like image 22
Aki Suihkonen Avatar answered Oct 22 '22 22:10

Aki Suihkonen