I'm looking for a more efficient way to convert from RGBA stored as doubles in premultiplied colorspace to 8-bit integer/channel RGBA non-premulitplied colorspace. It's a significant cost in my image processing.
For one channel, say R, the code looks something like this:
double temp = alpha > 0 ? src_r / alpha : 0
uint8_t out_r = (uint8_t)min( 255, max( 0, int(temp * 255 + 0.5) ) )
This involves three conditionals which I think prevent the compiler/CPU from optimizing this as well as it could. I think that some chips, in particular x86_64 have specialized double clamping operations, so in theory the above might be doable without conditionals.
Is there some technique, or special functions, which can make this conversion faster?
I'm using GCC and would be happy with a solution in C or C++ or with inline ASM if need be.
Here is an outline with some code (untested). This will convert four pixels at once. The main advantage of this method is that it only has to do the division once (not four times). Division is slow. But it has to do a tranpose (AoS to SoA) to do this. It uses mostly SSE except to convert the doubles to floats (which needs AVX).
1.) Load 16 doubles
2.) Convert them to floats
3.) Transpose from rgba rgba rgba rgba to rrrr gggg bbbb aaaa
4.) Divide all 4 alphas in one instruction
5.) Round floats to ints
6.) Compress 32-bit to 8-bit with saturation for underflow and overflow
7.) Transpose back to rgba rgba rgba rgba
9.) Write 4 pixels as integers in rgba format
#include <immintrin.h>
double rgba[16];
int out[4];
//load 16 doubles and convert to floats
__m128 tmp1 = _mm256_cvtpd_ps(_mm256_load_pd(&rgba[0]));
__m128 tmp2 = _mm256_cvtpd_ps(_mm256_load_pd(&rgba[4]));
__m128 tmp3 = _mm256_cvtpd_ps(_mm256_load_pd(&rgba[8]));
__m128 tmp4 = _mm256_cvtpd_ps(_mm256_load_pd(&rgba[12]));
//rgba rgba rgba rgba -> rrrr bbbb gggg aaaa
_MM_TRANSPOSE4_PS(tmp1,tmp2,tmp3,tmp4);
//fact = alpha > 0 ? 255.0f/ alpha : 0
__m128 fact = _mm_div_ps(_mm_set1_ps(255.0f),tmp4);
tmp1 = _mm_mul_ps(fact,tmp1); //rrrr
tmp2 = _mm_mul_ps(fact,tmp2); //gggg
tmp3 = _mm_mul_ps(fact,tmp3); //bbbb
tmp4 = _mm_mul_ps(_mm_set1_ps(255.0f), tmp4); //aaaa
//round to nearest int
__m128i tmp1i = _mm_cvtps_epi32(tmp1);
__m128i tmp2i = _mm_cvtps_epi32(tmp2);
__m128i tmp3i = _mm_cvtps_epi32(tmp3);
__m128i tmp4i = _mm_cvtps_epi32(tmp4);
//compress from 32bit to 8 bit
__m128i tmp5i = _mm_packs_epi32(tmp1i, tmp2i);
__m128i tmp6i = _mm_packs_epi32(tmp3i, tmp4i);
__m128i tmp7i = _mm_packs_epi16(tmp5i, tmp6i);
//transpose back to rgba rgba rgba rgba
__m128i out16 = _mm_shuffle_epi8(in16,_mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f));
_mm_store_si128((__m128i*)out, tmp7i);
Ok, this is pseudo code, but with SSE how about something like
const c = (1/255, 1/255, 1/255, 1/255)
floats = (r, g, b, a)
alpha = (a, a, a, a)
alpha *= (c, c, c, c)
floats /= alpha
ints = cvt_float_to_int(floats)
ints = max(ints, (255, 255, 255, 255))
Here's an implementation
void convert(const double* floats, byte* bytes, const int width, const int height, const int step) {
for(int y = 0; y < height; ++y) {
const double* float_row = floats + y * width;
byte* byte_row = bytes + y * step;
for(int x = 0; x < width; ++x) {
__m128d src1 = _mm_load_pd(float_row);
__m128d src2 = _mm_load_pd(float_row + 2);
__m128d mul = _mm_set1_pd(255.0f / float_row[3]);
__m128d norm1 = _mm_min_pd(_mm_set1_pd(255), _mm_mul_pd(src1, mul));
__m128d norm2 = _mm_min_pd(_mm_set1_pd(255), _mm_mul_pd(src2, mul));
__m128i dst1 = _mm_shuffle_epi8(_mm_cvtpd_epi32(norm1), _mm_set_epi8(0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,4,0));
__m128i dst2 = _mm_shuffle_epi8(_mm_cvtpd_epi32(norm2), _mm_set_epi8(0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,4,0,0x80,0x80));
_mm_store_ss((float*)byte_row, _mm_castsi128_ps(_mm_or_si128(dst1, dst2)));
float_row += 4;
byte_row += 4;
}
}
}
Edit: In my original answer I worked with floats instead of double, below if anyone's interested thanks to @Z boson for catching that - @OP: I don't handle alhpa==0
cases, so you'll get NaN
with my solution, if you want this handling, go with @Z boson's solution.
Here's the float version:
void convert(const float* floats, byte* bytes, const int width, const int height, const int step) {
for(int y = 0; y < height; ++y) {
const float* float_row = floats + y * width;
byte* byte_row = bytes + y * step;
for(int x = 0; x < width; ++x) {
__m128 src = _mm_load_ps(float_row);
__m128 mul = _mm_set1_ps(255.0f / float_row[3]);
__m128i cvt = _mm_cvtps_epi32(_mm_mul_ps(src, mul));
__m128i res = _mm_min_epi32(cvt, _mm_set1_epi32(255));
__m128i dst = _mm_shuffle_epi8(res, _mm_set_epi8(0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,12,8,4,0));
_mm_store_ss((float*)byte_row, _mm_castsi128_ps(dst));
float_row += 4;
byte_row += 4;
}
}
}
Because of SSE alignments constraints, make sure your input pointers are 16-bytes aligned, and use step
to make sure each row starts at an aligned address, many libs take such a step
argument, but if you don't need it, you can simplify by using a single loop.
I quickly tested with this and get good values:
int main() {
__declspec(align(16)) double src[] = { 10,100,1000,255, 10,100,20,50 };
__declspec(align(16)) byte dst[8];
convert(src, dst, 2, 1, 16); // dst == { 10,100,255,255 }
return 0;
}
I only have visual studio right now, so I can't test with gcc's optimizer, but I'm getting a x1.8 speedup for double and x4.5 for floats, it could be less with gcc -O3 but my code could be optimized more.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With