I'm working on a PowerPC machine with in-core crypto. I'm having trouble porting AES key expansion from big endian to little endian using built-ins. Big endian works, but little endian does not.
The algorithm below is the snippet presented in an IBM blog article. I think I have the issue isolated to line 2 below:
typedef __vector unsigned char uint8x16_p8;
uint8x64_p8 r0 = {0};
r3 = vec_perm(r1, r1, r5); /* line 1 */
r6 = vec_sld(r0, r1, 12); /* line 2 */
r3 = vcipherlast(r3, r4); /* line 3 */
r1 = vec_xor(r1, r6); /* line 4 */
r6 = vec_sld(r0, r6, 12); /* line 5 */
r1 = vec_xor(r1, r6); /* line 6 */
r6 = vec_sld(r0, r6, 12); /* line 7 */
r1 = vec_xor(r1, r6); /* line 8 */
r4 = vec_add(r4, r4); /* line 9 */
// r1 is ready for next round
r1 = vec_xor(r1, r3); /* line 10 */
Upon entering the function, both big endian and little endian have the following parameters:
(gdb) p r1
$1 = {0x2b, 0x7e, 0x15, 0x16, 0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88,
0x9, 0xcf, 0x4f, 0x3c}
(gdb) p r5
$2 = {0xd, 0xe, 0xf, 0xc, 0xd, 0xe, 0xf, 0xc, 0xd, 0xe, 0xf, 0xc, 0xd, 0xe,
0xf, 0xc}
However, after executing line 2, r6
has the value:
Little endian machine:
(gdb) p r6
$3 = {0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88, 0x9, 0xcf, 0x4f, 0x3c,
0x0, 0x0, 0x0, 0x0}
(gdb) p $vs0
$3 = {uint128 = 0x8815f7aba6d2ae28000000003c4fcf09, v2_double = {
4.9992689728788323e-315, -1.0395462025288474e-269}, v4_float = {
0.0126836384, 0, -1.46188823e-15, -4.51291888e-34}, v4_int32 = {
0x3c4fcf09, 0x0, 0xa6d2ae28, 0x8815f7ab}, v8_int16 = {0xcf09, 0x3c4f, 0x0,
0x0, 0xae28, 0xa6d2, 0xf7ab, 0x8815}, v16_int8 = {0x9, 0xcf, 0x4f, 0x3c,
0x0, 0x0, 0x0, 0x0, 0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88}}
Big endian machine:
(gdb) p r6
$4 = {0x0, 0x0, 0x0, 0x0, 0x2b, 0x7e, 0x15, 0x16, 0x28, 0xae, 0xd2, 0xa6,
0xab, 0xf7, 0x15, 0x88}
Notice the odd rotation on the little endian machine.
When I disassemble on the little endian machine after line 2 executes:
(gdb) disass $pc
<skip multiple pages>
0x0000000010000dc8 <+168>: lxvd2x vs12,r31,r9
0x0000000010000dcc <+172>: xxswapd vs12,vs12
0x0000000010000dd0 <+176>: xxlor vs32,vs0,vs0
0x0000000010000dd4 <+180>: xxlor vs33,vs12,vs12
0x0000000010000dd8 <+184>: vsldoi v0,v0,v1,12
0x0000000010000ddc <+188>: xxlor vs0,vs32,vs32
0x0000000010000de0 <+192>: xxswapd vs0,vs0
0x0000000010000de4 <+196>: li r9,64
0x0000000010000de8 <+200>: stxvd2x vs0,r31,r9
=> 0x0000000010000dec <+204>: li r9,48
0x0000000010000df0 <+208>: lxvd2x vs0,r31,r9
0x0000000010000df4 <+212>: xxswapd vs34,vs0
(gdb) p $v0
$5 = void
(gdb) p $vs0
$4 = {uint128 = 0x8815f7aba6d2ae28000000003c4fcf09, v2_double = {
4.9992689728788323e-315, -1.0395462025288474e-269}, v4_float = {
0.0126836384, 0, -1.46188823e-15, -4.51291888e-34}, v4_int32 = {
0x3c4fcf09, 0x0, 0xa6d2ae28, 0x8815f7ab}, v8_int16 = {0xcf09, 0x3c4f, 0x0,
0x0, 0xae28, 0xa6d2, 0xf7ab, 0x8815}, v16_int8 = {0x9, 0xcf, 0x4f, 0x3c,
0x0, 0x0, 0x0, 0x0, 0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88}}
I have no idea why r6
is not the expected value. Ideally I would examine the vsx register on both machines. Unfortunately GDB is also problematic on both machines so I can't do things like disassemble and print vector registers.
Is vec_sld
endian sensitive? Or is there something else wrong?
The advantages of Big Endian and Little Endian in a computer architecture. According to Wiki, Big endian is “the most common format in data networking”, many network protocols like TCP, UPD, IPv4 and IPv6 are using Big endian order to transmit data. Little endian is mainly using on microprocessors.
Note: Endianness does NOT affect ordering of array elements! However, endianness does affect the ordering of the bytes in each element of the array!
Bit order usually follows the same endianness as the byte order for a given computer system. That is, in a big endian system the most significant bit is stored at the lowest bit address; in a little endian system, the least significant bit is stored at the lowest bit address.
So when it comes to bit-shifting, endianness doesn't matter [as long as you are shifting the number in one unit]. It's only when reading/writing memory that endianness makes a difference - the bytes from a big number either go out "big end" or "little end" first.
Little endian with PowerPC/AltiVec can get a little mind-bending at times - if you need to make your code work with both big and little endian then it helps to define some portability macros, e.g. for vec_sld
:
#ifdef __BIG_ENDIAN__
#define VEC_SLD(va, vb, shift) vec_sld(va, vb, shift)
#else
#define VEC_SLD(va, vb, shift) vec_sld(vb, va, 16 - (shift))
#endif
You'll probably find this helpful for all intrinsics which involve horizontal/positional operations or narrowing/widening, e.g. vec_merge
, vec_pack
et al, vec_unpack
, vec_perm
, vec_mule
/vec_mulo
, vec_splat
, vec_lvsl
/vec_lvsr
, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With