Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does AES in SSE not provide full function?

The Rijndael key schedule procedure involves RotWord, SubWord, and XOR, which are all supported by _mm_aeskeygenassist_si128:

X3[31:0] ← SRC [127: 96];
X2[31:0] ← SRC [95: 64];
X1[31:0] ← SRC [63: 32];
X0[31:0] ← SRC [31: 0];
RCON[31:0] ← ZeroExtend(Imm8[7:0]);
DEST[31:0] ← SubWord(X1);
DEST[63:32 ] ← RotWord( SubWord(X1) ) XOR RCON;
DEST[95:64] ← SubWord(X3);
DEST[127:96] ← RotWord( SubWord(X3) ) XOR RCON;
DEST[VLMAX-1:128] (Unmodified)

However, it does not return a complete round key. For example, instead of simply performing

DEST[31:0] <- SubWord(X1),

I guess we should actually perform

DEST[31:0]<-RotWord(SubWord(X3)) XOR RCON XOR X0.

As a result, after _mm_aeskeygenassist_si128, we developers have to do some extra work before the round key is completely generated.

Why don't the SSE provide a complete AES key generation procedure?

like image 423
xtt Avatar asked Oct 01 '17 13:10

xtt


1 Answers

See Key Expansion Using AESKEYGENASSIST (page 23) in Intel's AES-NI whitepaper. They point out that the instruction can be used as a building block for different key sizes: 128/192/256. They only show an example for 128b, doing the extra work with a function call after each aeskeygenassist instruction as you describe.

AESKEYGENASSIST is already micro-coded (e.g. 13 uops on Skylake vs. only 1 for AESDEC / AESENC (http://agner.org/optimize/)), so having different instructions that do the last few steps that are different for different key sizes wouldn't make it run much faster the way it's currently implemented.

Skylake has 1 per 12 cycle aeskeygenassist throughput, but Nehalem had 1 per 2 cycle, same as aesenc. So in Nehalem, I guess they did implement it mostly in dedicated hardware. That's probably the other part of the explanation: more steps would have taken more hardware in the first-gen implementation, or made that instruction micro-coded (which it probably wasn't in Nehalem) to do the extra steps with more uops.

Intel obviously doesn't think key-setup is performance critical, because as I said they reduced the performance of aeskeygenassist after Nehalem. (Even Sandybridge had it microcoded with 1 per 8 clock throughput.)


Having different instructions for different keysizes would have taken more opcodes. At that point, Intel hadn't introduced VEX prefixes yet, so spending more opcode space on AES instructions would have reduced the room left for future extensions. (VEX has tons of coding space available, using only a couple of the multi-bit codes for the existing mandatory-prefix combinations used by current instructions.)

like image 172
Peter Cordes Avatar answered Nov 15 '22 05:11

Peter Cordes