In x86/SIMD assembly, I've populated an XMM register with four 32 bit pixels of a graphic image I need to convert. However, the pixels are in 10 bit packed RGB format, so they exist in 32 bits in this form:
[ red ][ green ][ blue ][]
RRRRRRRRRRGGGGGGGGGGBBBBBBBBBB00
The last two bits are padding bits and are unused.
I need to multiply these pixels by another value, but the value needs to be masked so it only affects say, the red pixels. This value is constant, so it can be hard-coded. Let's say the value is 0.1234. How would I put this into another XMM register with appropriate masking so it only affects the red portion of each 32 bit segment?
Illustrated graphically, I would like to do something like this:
XMM0 (first 32 bit segment):
[ 0.1234 ][ 1.0 ][ 1.0 ][]
*
XMM1 (first 32 bit segment):
RRRRRRRRRRGGGGGGGGGGBBBBBBBBBB00
With the result being the product of XMM0 and XMM1. Of course, this 32 bit segment would be duplicated across the entire XMM register, I just specified the first 32 bits here so you get the idea.
If you really only wanted to affect the red portion you might be able to come up with a trick that will multiply the red and part of the green by some constant (treating the register as a collection of 16-bit shorts) and then recombining just the new red part with the old green and blue.
A better strategy if you're going to operate on all of the colors is to unpack that format into a supported xmm register format (like 16- or 32-bit short or float) using a combination of shift and shuffle (and possibly convert to float) operations. Then do all of your math, then pack it back.
If you are ever re-using any values (for example, if you are computing a filter kernel) and you're working in float, it will be much faster if you unpack and convert to float once and then re-use that value over and over. Even if you have to make a loop that unpacks a whole row to 32-bit float before operating on it and re-packing the whole row.
Assuming you want to use floating point to multiply your values, I would unpack the R/G/B values into individual floating point sections of an XMM register (just divide by 1023.0) for each value.
You may also find that it's actually easier to prepare four R, four G, and four B values, and then build a value that has the same multiplier for each of the colour values in another XMM register, and multiply by that, rather than holding R, G and B in one register. Obviously, this would require a bit of unrolling of the loop, but that tends to improve performance quite a bit anyway.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With