I would like to compare two little-endian 256-bit values with A64 Neon instructions (asm) efficiently.
Equality (=)
For equality, I already got a solution:
bool eq256(const UInt256 *lhs, const UInt256 *rhs) {
bool result;
First, load the two values into SIMD registers.
__asm__("ld1.2d { v0, v1 }, %1 \n\t"
"ld1.2d { v2, v3 }, %2 \n\t"
Compare each 64-bit limb of the values with each other. This results in -1 (all bits set) for those limbs that are equal, and 0 (all bits clear) if a bit differs.
"cmeq.2d v0, v0, v2 \n\t"
"cmeq.2d v1, v1, v3 \n\t"
Reduce the result from 2 vectors to 1 vector, keeping only the one that contains "0 (all bits clear)" if there is any.
"uminp.16b v0, v0, v1 \n\t"
Reduce the result from 1 vector to 1 byte, keeping only a byte with zeroes if there is any.
"uminv.16b b0, v0 \n\t"
Move to ARM register, then compare with 0xFF. This is the result.
"umov %w0, v0.b[0] \n\t"
"cmp %w0, 0xFF \n\t"
"cset %w0, eq "
: "=r" (result)
: "m" (*lhs->value), "m" (*rhs->value)
: "v0", "v1", "v2", "v3", "cc");
return result;
}
Questions
Is this more efficient than doing the 4 comparisons with plain old ARM registers?
Is there a way to optimize this even further? I think that I waste many cycles just to reduce the whole vector to a single scalar boolean.
Less Than comparison (<)
Let's represent the two ints as tuples of 64-bit limbs (little-endian):
Then, lhs < rhs, if this evaluates to true:
(l3 < r3) & 1 & 1 & 1 |
(l3 = r3) & (l2 < r2) & 1 & 1 |
(l3 = r3) & (l2 = r2) & (l1 < r1) & 1 |
(l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0)
SIMD instructions can now be used to evaluate multiple operands at a time. Assuming (l1, l2), (l3, l4), (r1, r2), (r3, r4) is the way that the two 256-bit numbers are stored, we can easily get all of the required values (useful values in bold):
Questions
Update
I just punched together a working implementation for "less than".
Basically, I replaced the 1s above with a duplicate condition, because A & A == A & 1
.
Then, I lay out the three 2x2 squares in my matrix, and bitwise AND them. Now, I reduce with bitwise ORs - first from two vectors to one vector, then to one byte, then copy to ARM register, and test for 0xFF. Same pattern as for equality above.
Question above is still valid. I'm not sure if the code is optimal yet, and wonder if I missed some general SIMD pattern to do such stuff more efficiently. Also: Is NEON worth it for such cases, when the input operands come from memory?
bool lt256(const UInt256 *lhs, const UInt256 *rhs) {
bool result;
__asm__(// (l3 < r3) & (l3 < r3) |
// (l3 = r3) & (l2 < r2) |
// (l3 = r3) & (l2 = r2) & (l1 < r1) & (l1 < r1) |
// (l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0)
"ld1.2d { v0, v1 }, %1 \n\t"
"ld1.2d { v2, v3 }, %2 \n\t"
// v0: [ l3 = r3 ] [ l2 = r2 ]
// v1: [ l0 < r0 ] [ l1 < r1 ]
// v2: [ l0 = r0 ] [ l1 = r1 ]
// v3: [ l2 < r2 ] [ l3 < r3 ]
// v4: [ l2 = r2 ] [ l3 = r3 ]
"cmeq.2d v4, v1, v3 \n\t"
"cmlo.2d v3, v1, v3 \n\t"
"cmlo.2d v1, v0, v2 \n\t"
"cmeq.2d v2, v0, v2 \n\t"
"ext.16b v0, v4, v4, 8 \n\t"
// v2: [ l1 < r1 ] [ l1 = r1 ]
// v1: [ l1 < r1 ] [ l0 < r0 ]
"trn2.2d v2, v1, v2 \n\t"
"ext.16b v1, v1, v1, 8 \n\t"
// v1: [ l1 < r1 & l1 < r1 ] [ l1 = r1 & l0 < r0 ]
"and.16b v1, v2, v1 \n\t"
// v2: [ l3 < r3 ] [ l3 = r3 ]
// v3: [ l3 < r3 ] [ l2 < r2 ]
"ext.16b v2, v3, v0, 8 \n\t"
"ext.16b v3, v3, v3, 8 \n\t"
// v3: [ l3 < r3 & l3 < r3 ] [ l3 = r3 & l2 < r2 ]
"and.16b v3, v2, v3 \n\t"
// v2: [ l3 = r3 ] [ l3 = r3 ]
// v4: [ l2 = r2 ] [ l2 = r2 ]
"ext.16b v2, v4, v0, 8 \n\t"
"ext.16b v4, v0, v4, 8 \n\t"
// v2: [ l3 = r3 & l2 = r2 ] [ l3 = r3 & l2 = r2 ]
"and.16b v2, v2, v4 \n\t"
// v1: [ l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ]
// [ lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
"and.16b v1, v2, v1 \n\t"
// v1: [ l3 < r3 & l3 < r3 |
// l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ]
// [ l3 = r3 & l2 < r2 |
// lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
"orr.16b v1, v3, v1 \n\t"
// b1: [ l3 < r3 & l3 < r3 |
// l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 |
// l3 = r3 & l2 < r2 |
// lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
"umaxv.16b b1, v1 \n\t"
"umov %w0, v1.b[0] \n\t"
"cmp %w0, 0xFF \n\t"
"cset %w0, eq"
: "=r" (result)
: "m" (*lhs->value), "m" (*rhs->value)
: "v0", "v1", "v2", "v3", "v4", "cc");
return result;
}
Benchmark with XCTest measureMetrics with a Swift-based test runner. Two 256-bit Ints are allocated. Then, an operation is repeated 100 million times on the same two ints, measurement is stopped, and each limb of the two ints is assigned a new random value with arc4random. A second run is performed with Instruments attached, and the CPU time distribution is noted for each instruction as a comment right next to it.
Equality (==)
For equality, SIMD seems to lose when the result is transferred from the SIMD registers back to the ARM register. SIMD is probably only worth it when the result is used in further SIMD calculations, or if longer ints than 256-bit are used (ld1 seems to be faster than ldp).
SIMD
bool result;
__asm__("ld1.2d { v0, v1 }, %1 \n\t" // 5.1%
"ld1.2d { v2, v3 }, %2 \n\t" // 26.4%
"cmeq.2d v0, v0, v2 \n\t"
"cmeq.2d v1, v1, v3 \n\t"
"uminp.16b v0, v0, v1 \n\t" // 4.0%
"uminv.16b b0, v0 \n\t" // 26.7%
"umov %w0, v0.b[0] \n\t" // 32.9%
"cmp %w0, 0xFF \n\t" // 0.0%
"cset %w0, eq "
: "=r" (result)
: "m" (*lhs->value), "m" (*rhs->value)
: "v0", "v1", "v2", "v3", "cc");
return result; // 4.9% ("ret")
measured [Time, seconds] average: 11.558, relative standard deviation: 0.065%, values: [11.572626, 11.560558, 11.549322, 11.568718, 11.558530, 11.550490, 11.557086, 11.551803, 11.557529, 11.549782]
Standard
The winner here. The ccmp
instruction really shines here :-)
It is clear, though, that the problem is memory bound.
bool result;
__asm__("ldp x8, x9, %1 \n\t" // 33.4%
"ldp x10, x11, %2 \n\t"
"cmp x8, x10 \n\t"
"ccmp x9, x11, 0, eq \n\t"
"ldp x8, x9, %1, 16 \n\t" // 34.1%
"ldp x10, x11, %2, 16 \n\t"
"ccmp x8, x10, 0, eq \n\t" // 32.6%
"ccmp x9, x11, 0, eq \n\t"
"cset %w0, eq \n\t"
: "=r" (result)
: "m" (*lhs->value), "m" (*rhs->value)
: "x8", "x9", "x10", "x11", "cc");
return result;
measured [Time, seconds] average: 11.146, relative standard deviation: 0.034%, values: [11.149754, 11.142854, 11.146840, 11.149392, 11.141254, 11.148708, 11.142293, 11.150491, 11.139593, 11.145873]
C
LLVM fails to detect that "ccmp" is a good instruction to use here, and is slower than the asm version above.
return
lhs->value[0] == rhs->value[0] &
lhs->value[1] == rhs->value[1] &
lhs->value[2] == rhs->value[2] &
lhs->value[3] == rhs->value[3];
Compiled to
ldp x8, x9, [x0] // 24.1%
ldp x10, x11, [x1] // 0.1%
cmp x8, x10 // 0.4%
cset w8, eq // 1.0%
cmp x9, x11 // 23.7%
cset w9, eq
and w8, w8, w9 // 0.1%
ldp x9, x10, [x0, #16]
ldp x11, x12, [x1, #16] // 24.8%
cmp x9, x11
cset w9, eq // 0.2%
and w8, w8, w9
cmp x10, x12 // 0.3%
cset w9, eq // 25.2%
and w0, w8, w9
ret // 0.1%
measured [Time, seconds] average: 11.531, relative standard deviation: 0.040%, values: [11.525511, 11.529820, 11.541940, 11.531776, 11.533287, 11.526628, 11.531392, 11.526037, 11.531784, 11.533786]
Less Than (<)
(to be determined - will update later)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With