Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up planar to packed/interleaved graphics in C++?

I'm trying to program an Arduino Due to PWM a LED matrix. I need to ready the data before drawing each line, but the most inner loop in the process is too slow. The screen currently flickers. The loop should finish below 500us. The Arduino has a 84MHz Cortex-M3 ARM processor.

This is the concept of how I need to reassemble the bits for output:

5-bit color data:

R1=12, G1=4, B1=7, R2=0, G2=2, B2=27

The next step is to create a 32-bit stream of consecutive 1s. The number of 1s is given by the color value:

r1 = 0b00000000000000000000111111111111
g1 = 0b00000000000000000000000000001111
b1 = 0b00000000000000000000000001111111
r2 = 0b00000000000000000000000000000000
g2 = 0b00000000000000000000000000000011
b2 = 0b00000111111111111111111111111111

The last step is to reassemble every n-th bit of 10 pixels (total of 30 color values) into a 32-bit integer:

pack1 = 0b00 ... 111011
pack2 = 0b00 ... 111011
pack3 = 0b00 ... 111001
pack4 = 0b00 ... 111001
pack5 = 0b00 ... 101001
...

This is the code:

  // In my case scanwidth is 64*2 (64 is the width of the LED matrix and two lines are scanned at once)
  for ( i=0; i<scanwidth/5; i++) { // each run uses 5 upper and 5 lower pixels
      data = *lineptr++; // each int in the line buffer contains 2*15-bit inverted color data (red = 31-red etc.)
      p1uR = 0x7FFFFFFF >> (data >> 26); // pixel 1 of upper line red channel
      p1uG = 0x7FFFFFFF >> (data >> 21 & 0b11111);
      p1uB = 0x7FFFFFFF >> (data >> 16 & 0b11111);
      p1lR = 0x7FFFFFFF >> (data >> 10 & 0b11111);
      p1lG = 0x7FFFFFFF >> (data >> 5  & 0b11111);
      p1lB = 0x7FFFFFFF >> (data  & 0b11111);
      data = *lineptr++;
      p2uR = 0x7FFFFFFF >> (data >> 26);
      p2uG = 0x7FFFFFFF >> (data >> 21 & 0b11111);
      p2uB = 0x7FFFFFFF >> (data >> 16 & 0b11111);
      p2lR = 0x7FFFFFFF >> (data >> 10 & 0b11111);
      p2lG = 0x7FFFFFFF >> (data >> 5  & 0b11111);
      p2lB = 0x7FFFFFFF >> (data  & 0b11111);
      data = *lineptr++;
      p3uR = 0x7FFFFFFF >> (data >> 26);
      p3uG = 0x7FFFFFFF >> (data >> 21 & 0b11111);
      p3uB = 0x7FFFFFFF >> (data >> 16 & 0b11111);
      p3lR = 0x7FFFFFFF >> (data >> 10 & 0b11111);
      p3lG = 0x7FFFFFFF >> (data >> 5  & 0b11111);
      p3lB = 0x7FFFFFFF >> (data  & 0b11111);
      data = *lineptr++;
      p4uR = 0x7FFFFFFF >> (data >> 26);
      p4uG = 0x7FFFFFFF >> (data >> 21 & 0b11111);
      p4uB = 0x7FFFFFFF >> (data >> 16 & 0b11111);
      p4lR = 0x7FFFFFFF >> (data >> 10 & 0b11111);
      p4lG = 0x7FFFFFFF >> (data >> 5  & 0b11111);
      p4lB = 0x7FFFFFFF >> (data  & 0b11111);
      data = *lineptr++;
      p5uR = 0x7FFFFFFF >> (data >> 26);
      p5uG = 0x7FFFFFFF >> (data >> 21 & 0b11111);
      p5uB = 0x7FFFFFFF >> (data >> 16 & 0b11111);
      p5lR = 0x7FFFFFFF >> (data >> 10 & 0b11111);
      p5lG = 0x7FFFFFFF >> (data >> 5  & 0b11111);
      p5lB = 0x7FFFFFFF >> (data  & 0b11111);

      index = i;
      for (j=0; j<31; j++){ // loop over all 30 bits
          index += (scanwidth/5+1);
          scanbuff[index] = (p5uR>>j&1)<<29 | (p5uG>>j&1)<<28 | (p5uB>>j&1)<<27 | (p5lR>>j&1)<<26 | (p5lG>>j&1)<<25 | (p5lB>>j&1)<<24 
                          | (p4uR>>j&1)<<23 | (p4uG>>j&1)<<22 | (p4uB>>j&1)<<21 | (p4lR>>j&1)<<20 | (p4lG>>j&1)<<19 | (p4lB>>j&1)<<18 
                          | (p3uR>>j&1)<<17 | (p3uG>>j&1)<<16 | (p3uB>>j&1)<<15 | (p3lR>>j&1)<<14 | (p3lG>>j&1)<<13 | (p3lB>>j&1)<<12 
                          | (p2uR>>j&1)<<11 | (p2uG>>j&1)<<10 | (p2uB>>j&1)<<9  | (p2lR>>j&1)<<8  | (p2lG>>j&1)<<7  | (p2lB>>j&1)<<6 
                          | (p1uR>>j&1)<<5  | (p1uG>>j&1)<<4  | (p1uB>>j&1)<<3  | (p1lR>>j&1)<<2  | (p1lG>>j&1)<<1  | (p1lB>>j&1);
         }
     }

I don't think it's necessary to improve the outer loop. I did try to unroll the inner loop, but it didn't improve noticeably.

The Cortex-M3 can do shifts and logic in one clock cycle. I estimate the outer and inner loop to take around 51000 clock cycles (600us).

Is there anything I can improve with standard C++ code? Are there any improvements that can be done in inline-assembly?

like image 270
uzumaki Avatar asked Jul 25 '17 18:07

uzumaki


1 Answers

Time for some Cortex-M 3 black magic.

#include <cstdint>
#include <memory>
#include <cstring>

volatile char *const bitband_packed = (volatile char*)0x20000000;
volatile uint32_t *const bitband_exploded = (volatile uint32_t*)0x22000000;

static inline void transform_32_32(uint32_t buff[32]) {
    const std::size_t size = sizeof(buff[0])*32;
    volatile char *const tmp = bitband_packed;
    std::memcpy(const_cast<char*>(tmp), buff, size);
    for(std::size_t i = 0; i < 32; i++) {
        for(std::size_t j = i + 1; j < 32; j++) {
            std::swap(bitband_exploded[(32 * i + j)], bitband_exploded[(32 * j + i)]);
        }
    }
    std::memcpy(buff, const_cast<char*>(tmp), size);
}

void transform_pwm_32channel_5bit(const uint8_t input[32], uint32_t output[32]) {
    for(std::size_t i = 0; i < 32; i++) {
        output[i] = 0xffffffff >> input[i];
    }
    transform_32_32(output);
}

The Cortex-M series has a nice feature called Bit-Banding. This allows for a quite efficient bitwise matrix transform, which is coincidentally exactly what you need to bitbang efficiently.

The transform should perform in about ~3 cycles per bit (compiled on GCC 6.3 with -funroll-loops), so this should amount for only about 12k cycles in total, or around 150us.

The only catch? This assumes that your specific Cortex-M 3 actually supports the Bit-Band feature. I had no chance to test this on an Arduino.

like image 120
Ext3h Avatar answered Sep 18 '22 03:09

Ext3h