Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to transpose 4x4 byte matrix

I have a 4x4 block of bytes that I'd like to transpose using general purpose hardware. In other words, for bytes A-P, I'm looking for the most efficient (in terms of number of instructions) way to go from

A B C D
E F G H
I J K L
M N O P

to

A E I M
B F J N
C G K O
D H L P

We can assume that I have valid pointers pointing to A, E, I, and M in memory (such that reading 32-bits from A will get me the integer containing bytes ABCD).

This is not a duplicate of this question because of the restrictions on both size and data type. Each row of my matrix can fit into a 32-bit integer, and I'm looking for answers that can perform a transpose quickly using general purpose hardware, similar to the implementation of the SSE macro _MM_TRANSPOSE4_PS.

like image 384
Mokosha Avatar asked Jul 18 '14 03:07

Mokosha


2 Answers

You want potability and efficiency. Well you can't have it both ways. You said you want to do this with the fewest number of instructions. Well it's possible to do this with only one instruction with SSE3 using the pshufb instruction (see below) from the x86 instruction set.

Maybe ARM Neon has something equivalent. If you want efficiency (and are sure that you need it) then learn the hardware.

The SSE equivalent of _MM_TRANSPOSE4_PS for bytes is to use _mm_shuffle_epi8 (the intrinsic for pshufb) with a mask. Define the mask outside of your main loop.

//use -msse3 with GCC or /arch:SSE2 with MSVC
#include <stdio.h>
#include <tmmintrin.h> //SSSE3
int main() {
    char x[] = {0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,15,16};
    __m128i mask = _mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f);

    __m128i v = _mm_loadu_si128((__m128i*)x);
    v = _mm_shuffle_epi8(v,mask);
    _mm_storeu_si128((__m128i*)x,v);
    for(int i=0; i<16; i++) printf("%d ", x[i]); printf("\n");
    //output: 0 4 8 12 1 5 9 13 2 6 10 15 3 7 11 16   
}
like image 123
Z boson Avatar answered Sep 17 '22 12:09

Z boson


Let me rephrase your question: you're asking for a C- or C++-only solution that is portable. Then:

void transpose(uint32_t const in[4], uint32_t out[4]) {
  // A B C D    A E I M
  // E F G H    B F J N
  // I J K L    C G K O
  // M N O P    D H L P

  out[0] = in[0] & 0xFF000000U; // A . . .
  out[1] = in[1] & 0x00FF0000U; // . F . .
  out[2] = in[2] & 0x0000FF00U; // . . K .
  out[3] = in[3] & 0x000000FFU; // . . . P

  out[1] |= (in[0] <<  8) & 0xFF000000U; // B F . .
  out[2] |= (in[0] << 16) & 0xFF000000U; // C . K .
  out[3] |= (in[0] << 24);               // D . . P

  out[0] |= (in[1] >>  8) & 0x00FF0000U; // A E . .
  out[2] |= (in[1] <<  8) & 0x00FF0000U; // C G K .
  out[3] |= (in[1] << 16) & 0x00FF0000U; // D H . P

  out[0] |= (in[2] >> 16) & 0x0000FF00U; // A E I .
  out[1] |= (in[2] >>  8) & 0x0000FF00U; // B F J .
  out[3] |= (in[2] <<  8) & 0x0000FF00U; // D H L P

  out[0] |= (in[3] >> 24);               // A E I M
  out[1] |= (in[3] >>  8) & 0x000000FFU; // B F J N
  out[2] |= (in[3] <<  8) & 0x000000FFU; // C G K O
}

I don't see how it could be answered any other way, since then you'd be depending on a particular compiler compiling it in a particular way, etc.

Of course if those manipulations themselves can be somehow simplified, it'd help. So that's the only avenue of further pursuit here. Nothing stands out so far, but then it's been a long day for me.

So far, the cost is 12 shifts, 12 ORs, 16 ANDs. If the compiler and platform are any good, it can be done in 9 32 bit registers.

If the compiler is very sad, or the platform doesn't have a barrel shifter, then some casting could help extol the fact that the shifts and masks are just byte extractions:

void transpose(uint8_t const in[16], uint8_t out[16]) {
  // A B C D    A E I M
  // E F G H    B F J N
  // I J K L    C G K O
  // M N O P    D H L P

  out[0]  = in[0];  // A . . .
  out[1]  = in[4];  // A E . .
  out[2]  = in[8];  // A E I .
  out[3]  = in[12]; // A E I M
  out[4]  = in[1];  // B . . .
  out[5]  = in[5];  // B F . .
  out[6]  = in[9];  // B F J .
  out[7]  = in[13]; // B F J N
  out[8]  = in[2];  // C . . .
  out[9]  = in[6];  // C G . .
  out[10] = in[10]; // C G K .
  out[11] = in[14]; // C G K O
  out[12] = in[3];  // D . . .
  out[13] = in[7];  // D H . .
  out[14] = in[11]; // D H L .
  out[15] = in[15]; // D H L P
}

If you really want to shuffle it in-place, then the following would do.

void transpose(uint8_t m[16]) {
  std::swap(m[1], m[4]);
  std::swap(m[2], m[8]);
  std::swap(m[3], m[12]);
  std::swap(m[6], m[9]);
  std::swap(m[7], m[13]);
  std::swap(m[11], m[14]);
}

The byte-oriented versions may well produce worse code on modern platforms. Only a benchmark can tell.

like image 26
Kuba hasn't forgotten Monica Avatar answered Sep 21 '22 12:09

Kuba hasn't forgotten Monica