I need to copy the contents of a byte array representing an image in RGB byte order into another RGBA(4 bytes per pixel) buffer. The alpha channel will get filled later. What would be the fastest way of achieving this?
How tricky do you want it? You could set it up to copy a 4-byte word at a time, which might be a bit faster on some 32-bit systems:
void fast_unpack(char* rgba, const char* rgb, const int count) {
if(count==0)
return;
for(int i=count; --i; rgba+=4, rgb+=3) {
*(uint32_t*)(void*)rgba = *(const uint32_t*)(const void*)rgb;
}
for(int j=0; j<3; ++j) {
rgba[j] = rgb[j];
}
}
The extra case on the end is to deal with the fact that the rgb array is missing a byte. You could also make it a bit faster using aligned moves and SSE instructions, working in multiples of 4 pixels at a time. If you're feeling really ambitious, you can try even more horribly obfuscated things like prefetching a cache line into the FP registers, for example, then blitting it across to the other image all at once. Of course the mileage you get out of these optimizations is going to be highly dependent on the specific system configuration you are targetting, and I would be really skeptical that there is much benefit at all to doing any of this instead of the simple thing.
And my simple experiments confirm that this is indeed a little bit faster, at least on my x86 machine. Here is a benchmark:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
void fast_unpack(char* rgba, const char* rgb, const int count) {
if(count==0)
return;
for(int i=count; --i; rgba+=4, rgb+=3) {
*(uint32_t*)(void*)rgba = *(const uint32_t*)(const void*)rgb;
}
for(int j=0; j<3; ++j) {
rgba[j] = rgb[j];
}
}
void simple_unpack(char* rgba, const char* rgb, const int count) {
for(int i=0; i<count; ++i) {
for(int j=0; j<3; ++j) {
rgba[j] = rgb[j];
}
rgba += 4;
rgb += 3;
}
}
int main() {
const int count = 512*512;
const int N = 10000;
char* src = (char*)malloc(count * 3);
char* dst = (char*)malloc(count * 4);
clock_t c0, c1;
double t;
printf("Image size = %d bytes\n", count);
printf("Number of iterations = %d\n", N);
printf("Testing simple unpack....");
c0 = clock();
for(int i=0; i<N; ++i) {
simple_unpack(dst, src, count);
}
c1 = clock();
printf("Done\n");
t = (double)(c1 - c0) / (double)CLOCKS_PER_SEC;
printf("Elapsed time: %lf\nAverage time: %lf\n", t, t/N);
printf("Testing tricky unpack....");
c0 = clock();
for(int i=0; i<N; ++i) {
fast_unpack(dst, src, count);
}
c1 = clock();
printf("Done\n");
t = (double)(c1 - c0) / (double)CLOCKS_PER_SEC;
printf("Elapsed time: %lf\nAverage time: %lf\n", t, t/N);
return 0;
}
And here are the results (compiled with g++ -O3):
Image size = 262144 bytes
Number of iterations = 10000
Testing simple unpack....Done
Elapsed time: 3.830000
Average time: 0.000383
Testing tricky unpack....Done
Elapsed time: 2.390000
Average time: 0.000239
So, maybe about 40% faster on a good day.
The fastest was would be to use a library that implements the conversion for you rather than writing it yourself. Which platform[s] are you targeting?
If you insist on writing it yourself for some reason, write a simple and correct version first. Use that. If the performance is inadequate, then you can think about optimizing it. In general, this sort of conversion is best done using vector permutes, but the exact optimal sequence varies depending on the target architecture.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With