I have many truth-tables of many variables (7 or more) and I use a tool (eg logic friday 1) to simplify the logic formula. I could do that by hand but that is much too error prone. These formula I then translate to compiler intrinsics (eg _mm_xor_epi32) which works fine.
Question: with vpternlog I can make ternary logic operations. But I'm not aware of a method to simplify my truth-tables to sequences of vpternlog instructions that are (somewhat) efficient.
I'm not asking if someone knows a tool that simplifies to arbitrary ternary logic operations, although that would be great, I'm looking for a method to do such simplifications.
Edit: I asked a similar question on Electrical Engineering.
Outside of just leaving it to the compiler, or the hand-wavy suggestions in the 2nd section of my answer, see HJLebbink's self-answer using FPGA logic-optimization tools. (This answer ended up with the bounty because it failed to attract such an answer from anyone else; it's not really bounty-worthy. :/ I wrote it before there was a bounty, but don't have anything else useful to add.)
ICC18 optimizes chained _mm512_and/or/xor_epi32
intrinsics into vpternlogd
instructions, but gcc/clang don't.
On Godbolt for this and a more complicated function using some inputs multiple times:
#include <immintrin.h>
__m512i logic(__m512i a, __m512i b, __m512i c,
__m512i d, __m512i e, __m512i f, __m512i g) {
// return _mm512_and_epi32(_mm512_and_epi32(a, b), c);
return a & b & c & d & e & f;
}
gcc -O3 -march=skylake-avx512
nightly build
logic:
vpandq zmm4, zmm4, zmm5
vpandq zmm3, zmm2, zmm3
vpandq zmm4, zmm4, zmm3
vpandq zmm0, zmm0, zmm1
vpandq zmm0, zmm4, zmm0
ret
ICC18 -O3 -march=skylake-avx512
logic:
vpternlogd zmm2, zmm0, zmm1, 128 #6.21
vpternlogd zmm4, zmm2, zmm3, 128 #6.29
vpandd zmm0, zmm4, zmm5 #6.33
ret #6.33
IDK how good it is at picking optimal solutions when each variable is used more than once in different subexpressions.
To see if it does a good job, you have to do the optimization yourself. You want to find sets of 3 variables that can be combined together into a single boolean value without still needing those 3 variables anywhere else in the expression.
I think it's possible for a truth table with more than 3 inputs to not simplify down this way, to a smaller truth table where one of the columns is the result of a ternary combination of 3 of the inputs. e.g. I think it's not guaranteed that it's possible to simplify a 4 input function to vpternlog + AND, OR, or XOR.
I'd definitely worry that compilers might pick 3 inputs to combine that didn't result in as much simplification as a different choice of 3.
It might even be optimal for a compiler to start with a binary operation or two on a couple pairs to set up for a ternary operation, especially if that enables better ILP.
You could probably write a brute-force truth-table optimizer that looked for triplets of variables that could be combined to make a smaller table for just the ternary result and the rest of the table. But I'm not sure a greedy approach is guaranteed to give the best results. If there are multiple ways to combine with the same total instruction count, they're probably not all equivalent for ILP (Instruction Level Parallelism).
How to translate a truth table to a sequence of vpternlog
instructions.
Content of BF_Q6.eqn:
INORDER = A B C D E F;
OUTORDER = F0 F1;
F0 = (!A*!B*!C*!D*!E*F) + (!A*!B*!C*!D*E*!F) + (!A*!B*!C*D*!E*!F) + (!A*!B*C*!D*!E*!F) + (!A*B*!C*!D*!E*!F) + (A*!B*!C*!D*!E*!F);
F1 = (!A*!B*!C*!D*E) + (!A*!B*!C*D*!E) + (!A*!B*C*!D*!E) + (!A*B*!C*!D*!E) + (A*!B*!C*!D*!E);
In ABC I run:
abc 01> read_eqn BF_Q6.eqn
abc 02> choice; if -K 3; ps
abc 03> lutpack -N 3 -S 3; ps
abc 04> show
abc 05> write_bench BF_Q6.bench
You may need to run choice; if -K 3; ps
multiple time to get better results.
The resulting BF_Q6.bench contains the 3-LUTs for an FPGA:
INPUT(A)
INPUT(B)
INPUT(C)
INPUT(D)
INPUT(E)
INPUT(F)
OUTPUT(F0)
OUTPUT(F1)
n11 = LUT 0x01 ( B, C, D )
n12 = LUT 0x1 ( A, E )
n14 = LUT 0x9 ( A, E )
n16 = LUT 0xe9 ( B, C, D )
n18 = LUT 0x2 ( n11, n14 )
F1 = LUT 0xae ( n18, n12, n16 )
n21 = LUT 0xd9 ( F, n11, n14 )
n22 = LUT 0xd9 ( F, n12, n16 )
F0 = LUT 0x95 ( F, n21, n22 )
4. This can be translated to C++ mechanically:
__m512i not(__m512i a) { return _mm512_xor_si512(a, _mm512_set1_epi32(-1)); }
__m512i binary(__m512i a, __m512i b, int i) {
switch (i)
{
case 0: return _mm512_setzero_si512();
case 1: return not(_mm512_or_si512(a, b));
case 2: return _mm512_andnot_si512(a, b);
case 3: return not(a);
case 4: return _mm512_andnot_si512(b, a);
case 5: return not(b);
case 6: return _mm512_xor_si512(a, b);
case 7: return not(_mm512_and_si512(a, b));
case 8: return _mm512_and_si512(a, b);
case 9: return not(_mm512_xor_si512(a, b));
case 10: return b;
case 11: return _mm512_or_si512(not(a), b);
case 12: return a;
case 13: return mm512_or_si512(a, not(b));
case 14: return _mm512_or_si512(a, b);
case 15: return _mm512_set1_epi32(-1);
default: return _mm512_setzero_si512();
}
}
void BF_Q6(const __m512i A, const __m512i B, const __m512i C, const __m512i D, const __m512i E, const __m512i F, __m512i& F0, __m512i& F1) {
const auto n11 = _mm512_ternarylogic_epi64(B, C, D, 0x01);
const auto n12 = binary(A, E, 0x1);
const auto n14 = binary(A, E, 0x9);
const auto n16 = _mm512_ternarylogic_epi64(B, C, D, 0xe9);
const auto n18 = binary(n11, n14, 0x2);
F1 = _mm512_ternarylogic_epi64(n18, n12, n16, 0xae);
const auto n21 = _mm512_ternarylogic_epi64(F, n11, n14, 0xd9);
const auto n22 = _mm512_ternarylogic_epi64(F, n12, n16, 0xd9);
F0 = _mm512_ternarylogic_epi64(F, n21, n22, 0x95);
}
The question remains whether the resulting C++ code is optimal. I don't think that this method (often) yields the smallest networks of 3-LUTs, simply because this problem is NP-hard. Additionally, it is not possible to inform ABC about instruction parallelism, and it is not possible to prioritize the order of variables such that variables that will be used later are not in the first position of the LUT (since the first source operand is overwritten with the result). But the compiler may be smart enough to do such optimizations.
ICC18 gives the following assembly:
00007FF75DCE1340 sub rsp,78h
00007FF75DCE1344 vmovups xmmword ptr [rsp+40h],xmm15
00007FF75DCE134A vmovups zmm2,zmmword ptr [r9]
00007FF75DCE1350 vmovups zmm1,zmmword ptr [r8]
00007FF75DCE1356 vmovups zmm5,zmmword ptr [rdx]
00007FF75DCE135C vmovups zmm4,zmmword ptr [rcx]
00007FF75DCE1362 vpternlogd zmm15, zmm15, zmm15, 0FFh
00007FF75DCE1369 vpternlogq zmm5, zmm1, zmm2, 0E9h
00007FF75DCE1370 vmovaps zmm3, zmm2
00007FF75DCE1376 mov rax, qword ptr[&E]
00007FF75DCE137E vpternlogq zmm3, zmm1, zmmword ptr[rdx], 1
00007FF75DCE1385 mov r11, qword ptr[&F]
00007FF75DCE138D mov rcx, qword ptr[F0]
00007FF75DCE1395 mov r10, qword ptr[F1]
00007FF75DCE139D vpord zmm0, zmm4, zmmword ptr[rax]
00007FF75DCE13A3 vpxord zmm4, zmm4, zmmword ptr[rax]
00007FF75DCE13A9 vpxord zmm0, zmm0, zmm15
00007FF75DCE13AF vpxord zmm15, zmm4, zmm15
00007FF75DCE13B5 vpandnd zmm1, zmm3, zmm15
00007FF75DCE13BB vpternlogq zmm1, zmm0, zmm5, 0AEh
00007FF75DCE13C2 vpternlogq zmm15, zmm3, zmmword ptr[r11], 0CBh
^^^^^ ^^^^^^^^^^^^^^^^
00007FF75DCE13C9 vpternlogq zmm5, zmm0, zmmword ptr[r11], 0CBh
00007FF75DCE13D0 vmovups zmmword ptr[r10], zmm1
00007FF75DCE13D6 vpternlogq zmm5, zmm15, zmmword ptr[r11], 87h
00007FF75DCE13DD vmovups zmmword ptr [rcx],zmm5
00007FF75DCE13E3 vzeroupper
00007FF75DCE13E6 vmovups xmm15,xmmword ptr [rsp+40h]
00007FF75DCE13EC add rsp,78h
00007FF75DCE13F0 ret
ICC18 is able to change the variable ordering in const auto n22 = _mm512_ternarylogic_epi64(F, n12, n16, 0xd9);
to vpternlogq zmm15, zmm3, zmmword ptr[r11], 0CBh
such that variable F is not overwritten. (But strangely enough fetched from memory 3 times...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With