I'm having a heck of a time trying to come up with a constant time rotate that does not violate the C/C++ standards.
The problem is the edge/corner cases, where operations are called out in algorithms and those algorithms cannot be changed. For example, the following is from Crypto++ and executes the test harness under GCC ubsan (i.e., g++ fsanitize=undefined
):
$ ./cryptest.exe v | grep runtime
misc.h:637:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:643:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:625:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:637:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:643:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:637:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
And the code at misc.h:637
:
template <class T> inline T rotlMod(T x, unsigned int y)
{
y %= sizeof(T)*8;
return T((x<<y) | (x>>(sizeof(T)*8-y)));
}
Intel's ICC was particularly ruthless, and it removed the entire function call with out the y %= sizeof(T)*8
. We fixed that a few years back, but left the other errata in-place due to lack of a constant time solution.
There's one pain point remaining. When y = 0
, I get a condition where 32 - y = 32
, and it sets up the undefined behavior. If I add a check for if(y == 0) ...
, then the code fails to meet the constant time requirement.
I've looked at a number of other implementations, from the Linux kernel to other cryptographic libraries. They all contain the same undefined behavior, so it appears to be a dead end.
How can I perform the rotate in nearly constant time with a minimum number of instructions?
EDIT: by near constant time, I mean avoid the branch so the same instructions are always executed. I'm not worried about CPU microcode timings. While branch prediction may be great on x86/x64, it may not perform as well on other platforms, like embedded.
None of these tricks would be required if GCC or Clang provided an intrinsic to perform the rotate in near constant time. I'd even settle for "perform the rotate" since they don't even have that.
I've linked to this answer for the full details from several other "rotate" questions, including this community wiki question, which should be kept up to date with best-practices.
I found a blog post about this issue, and it looks like it's finally a solved problem (with new-enough compiler versions).
John Regehr at the University of Utah recommends version "c" of his attempts at making a rotate function. I replaced his assert with a bitwise AND, and found that it still compiles to a single rotate insn.
typedef uint32_t rotwidth_t; // parameterize for comparing compiler output with various sizes
rotwidth_t rotl (rotwidth_t x, unsigned int n)
{
const unsigned int mask = (CHAR_BIT*sizeof(x)-1); // e.g. 31
assert ( (n<=mask) &&"rotate by type width or more");
n &= mask; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers
return (x<<n) | (x>>( (-n)&mask ));
}
rotwidth_t rot_const(rotwidth_t x)
{
return rotl(x, 7);
}
This could be templated on x's type, but it probably makes more sense for real use, to have the width in the function name (like rotl32
). Usually when you're rotating, you know what width you want, and that matters more than what size variable you're currently storing the value in.
Also make sure to only use this with unsigned types. Right-shift of signed types does an arithmetic shift, shifting in sign-bits. (It's technically implementation-dependent behaviour, but everything uses 2's complement now.)
Pabigot independently came up with the same idea before I did, and posted it at gibhub. His version has C++ static_assert checking to make it a compile-time error to use a rotate count outside the range for the type.
I tested mine with gcc.godbolt.org, with NDEBUG defined, for variable and compile-time-const rotate counts:
shld $7, %edi, %edi
. Fine without -march=native
)Even newer compiler versions can handle the commonly-given code from wikipedia (included in the godbolt sample) without generating a branch or cmov. John Regehr's version has the advantage of avoiding undefined behaviour when the rotate count is 0.
There are some caveats with 8 and 16 bit rotates, but compilers seem fine with 32 or 64, when n
is uint32_t
. See the comments in the code on the godbolt link for some notes from my testing various widths of uint*_t
. Hopefully this idiom will be better-recognized by all compilers for more combinations of type widths in the future. Sometimes gcc will uselessly emit an AND
insn on the rotate count, even though the x86 ISA defines the rotate insns with that exact AND as the first step.
"optimal" means as efficient as:
# gcc 4.9.2 rotl(unsigned int, unsigned int):
movl %edi, %eax
movl %esi, %ecx
roll %cl, %eax
ret
# rot_const(unsigned int):
movl %edi, %eax
roll $7, %eax
ret
When inlined, the compiler should be able to arrange for values to be in the right registers in the first place, resulting in just a single rotate.
With older compilers, you'll still get ideal code when the rotate count is a compile-time constant. Godbolt lets you test with ARM as a target, and it used a rotate there, too. With variable counts on older compilers, you get a bit of code bloat, but no branches or major performance problems, so this idiom should be safe to use in general.
BTW, I modified John Regehr's original to use CHAR_BIT*sizeof(x), and gcc / clang / icc emit optimal code for uint64_t
as well. However, I did notice that changing x
to uint64_t
while the function return type is still uint32_t
makes gcc compile it to shifts/or. So be careful to cast the result to 32bits in a separate sequence point, if you want the low 32b of a 64b rotate. i.e. Assign the result to a 64bit variable, then cast/return it. icc still generates a rotate insn, but gcc and clang don't, for
// generates slow code: cast separately.
uint32_t r = (uint32_t)( (x<<n) | (x>>( -n&(CHAR_BIT*sizeof(x)-1) )) );
If anyone can test this with MSVC, it would be useful to know what happens there.
You can add one additional modulo operation to prevent the shifting by 32 bits, but I'm not convinced this is faster than using an if check in conjunction with branch predictors.
template <class T> inline T rotlMod(T x, unsigned int y)
{
y %= sizeof(T)*8;
return T((x<<y) | (x>>((sizeof(T)*8-y) % (sizeof(T)*8))));
}
Writing the expression as T((x<<y) | ((x>>(sizeof(T)*CHAR_BITS-y-1)>>1))
should yield defined behavior for all values of y
below the bit size, assuming that T
is an unsigned type with no padding. Unless the compiler has a good optimizer, the resulting code may not be as good as what would have been produced by your original expression. Having to put up with clunky hard to read code which will yield slower execution on many compilers is part of the price of progress, however, since a hyper-modern compiler which is given
if (y) do_something();
return T((x<<y) | (x>>(sizeof(T)*8-y)));
might improve the "efficiency" of the code by making the call to do_something
unconditional.
PS: I wonder if there are any real-world platforms where changing the definition of shift-right so that x >> y
when y
is precisely equal to the bit size of x
, would be required to yield either 0 or x, but could make the choice in an arbitrary (unspecified) fashion, would require the platform to generate extra code or would preclude genuinely useful optimizations in non-contrived scenarios?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With