I am creating a struct to store a single RGB pixel in an image.
struct Pixel
{
// color values range from 0.0 to 1.0
float r, g, b;
}__attribute__((aligned(16));
I want to use 128 bit SSE instructions to do things like adding, multiplying, etc. This way I can perform operations on all 3 color channels at once. So the first packed float in my SSE register would be red, then green, then blue, but I am not so sure what would go into my fourth register. I really don't care what bits are in the extra 32 bits of padding. When I load a pixel into the SSE register I would imagine it contains either zeros or junk values. Is this problematic? Should I add a fourth alpha channel even though I don't really need one? The only way I see this being an issue is if I were dividing by a pixel and there was a zero value in the fourth spot, or of I was taking a root of a negative, etc.
Integer ops will have no problem at all with uninitialized values, since the latency is never data-dependent. Floating point is different. Some FPUs slow down on denormals, NaNs, and infinities (in any one of the vector elements).
Intel Nehalem and earlier slow down a lot when doing math ops with denormal inputs/outputs, and on FP underflow/overflow. Sandybridge has a nice FPU with fast add/sub for any inputs (according to Agner Fog's instruction tables), but multiply can still slow down.
Add/sub/multiply are fine with zeros, but potentially a problem with uninitialized junk that might represent NaN or something.
Be careful with division that you aren't dividing by zero. That could even raise an FPU exception, depending on HW settings.
So yes, keeping the unused element zeroed is probably a good idea. Depending how you generate things in the first place, this may be pretty cheap to accomplish. (e.g. movd/pinsrd/pinsrd (or insertps) to put three 32bit elements into a vector, with the initial movd zeroing the high 96b.)
One workaround could be to store a 2nd copy of the blue channel in the 4th element. (or whatever is most convenient to shuffle there.) You could load vectors with movsldup
(SSE3) / movlps
. After movsldup
, your register would hold { b b r r }
. movlps
would re-load the lower 64bits, so you'd have { b b g r }
. (This is equivalent to movsd
, BTW.) Or if the shuffle port is less busy than the load ports, do one 16B load and then shufps. (movsldup
on Intel CPUs is a single uop that runs on a load port, even though it has the duplication built in.)
Another option would be to pack your pixels into 12 bytes, so a 16B load would get one component of the next pixel. Depending on what you're doing, overlapping stores that clobber one element of the next pixel might or might not be ok. Loading the next pixel before storing the current could work around that for some ops. It's quite easy to be cache or bandwidth-limited, so saving 1/4 space at the small cost of the occasional cache-line split load/store could be worth it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With