Suppose I have an inline function:
inline int mul(short x, short y) {
return (int)x * (int)y;
}
Here y is in {1,2,...,32}, and x is in {-4,-3,-2,-1,0,1,...,8192}. Considering y is in a very small range, does there exist a way to speed up mul()?
Background: this code is extracted from a scientific computing program written in C/C++, and profiling has shown that the above function consumes over 10% CPU time of the whole program since it is called very frequently. Therefore, I would like to try to figure out a way to speed it up.
Thank you :)
Intel's SSE4 intrinsics provide the data type __m128i, which can hold 4 32-bit values.
__m128i _mm_mullo_epi32(__m128i a, __m128i b)Packed integer 32-bit multiplication with truncation of upper halves of results.
Reference
You can perform 4 multiplications at a time. Since you know that your data range is limited, truncation won't be a problem. You could also use single-precision floating point and the older mulps intrinsic.
Besides, it might be a good idea to analyze your program with a profiler like VTune and see if you are suffering from excessive cache misses, aliasing, or alignment problems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With