This is not homework, just something I though of. So, straight computing factorial is not exactly fast; memoization can help, but if the result is to fit into 32 or 64 bits, then the factorial only can work for inputs 0
through 12
and 20
respectively. So ... we might as well use a lookup table:
n n!
0 1
1 1
2 2
3 6
4 24
5 120
6 720
7 5040
8 40320
9 362880
10 3628800
11 39916800
12 479001600
13 6227020800 2^32= 4294967296
14 87178291200
15 1.30767E+12
16 2.09228E+13
17 3.55687E+14
18 6.40237E+15
19 1.21645E+17
20 2.4329E+18
2^64= 1.84467E+19
So, suppose I want to have an inline C++ factorial function which uses inline assembly, with a 32 bit or 64 bit unsigned integer expected as a result. If the input is either negative or large enough to cause overflow, the output should be 0. How can this be done in assembly such that it consumes the least amount of cycles? This code will run on a 64-bit Intel/AMD architecture. If feasible, I am interested in improving the worst case scenario, so 20!
should not take a lot longer to compute than 0!
- hopefully there is a binary search approach. Hopefully there is a clever trick for doing if (n == 0 || n == 1) { return 1; }
. Also, if the output needs to be 32 bit, then I think assembly instructions can contain both code and data in them. My assembly knowledge is weak. Please let me know if the question does not make a whole lot of sense.
Being able to use the function in C++ would be nice - makes it a more realistic problem. If, for instance, calling a function is expensive, then trying to save 1-2 clock cycles in the body of the assembly will not help much.
I have cleverly built a lookup table in assembly. Just in case you're rusty on your assembly,
The routine expects the parameter to be in the ecx
register. I verify that it is in range then read the lookup table's values into the eax
and edx
registers. If the value is out of range, I just xor the eax
and edx
registers with themselves (this forces them to 0). Unfortunately, since it's an assembly routine the compiler will be unable to inline the code. But, I'm sure the few cycles I saved by writing my awesome assembly routine will more than make up for any gain by inlining.
factorial:
xorl %eax, %eax
xorl %edx, %edx
cmpl $20, %ecx
ja .TOOBIG
movl CSWTCH.1(,%ecx,8), %eax
movl CSWTCH.1+4(,%ecx,8), %edx
.TOOBIG:
LOOKUP_TABLE:
.section .rodata
.align 32
.type CSWTCH.1, @object
.size CSWTCH.1, 168
CSWTCH.1:
.long 1
.long 0
.long 1
.long 0
.long 2
.long 0
.long 6
.long 0
.long 24
.long 0
.long 120
.long 0
.long 720
.long 0
.long 5040
.long 0
.long 40320
.long 0
.long 362880
.long 0
.long 3628800
.long 0
.long 39916800
.long 0
.long 479001600
.long 0
.long 1932053504
.long 1
.long 1278945280
.long 20
.long 2004310016
.long 304
.long 2004189184
.long 4871
.long -288522240
.long 82814
.long -898433024
.long 1490668
.long 109641728
.long 28322707
.long -2102132736
.long 566454140
The lookup table is difficult to maintain, so I've included the script I used to build it
static constexpr uint64_t const_factorial(uint32_t i) {
return (i==0)? 1: (i * const_factorial(i-1));
}
uint64_t factorial(uint32_t i) {
switch(i) {
case 0: return const_factorial(0);
case 1: return const_factorial(1);
case 2: return const_factorial(2);
case 3: return const_factorial(3);
case 4: return const_factorial(4);
case 5: return const_factorial(5);
case 6: return const_factorial(6);
case 7: return const_factorial(7);
case 8: return const_factorial(8);
case 9: return const_factorial(9);
case 10: return const_factorial(10);
case 11: return const_factorial(11);
case 12: return const_factorial(12);
case 13: return const_factorial(13);
case 14: return const_factorial(14);
case 15: return const_factorial(15);
case 16: return const_factorial(16);
case 17: return const_factorial(17);
case 18: return const_factorial(18);
case 19: return const_factorial(19);
case 20: return const_factorial(20);
default: return 0;
}
}
Just in case you missed it in my poor attempt at humor. The C++ compiler is more than capable at correctly optimizing your code. As you can see I didn't need to do anything fancy with lookup tables, binary search trees, or hashes. Just a simple switch
statement and the compiler did the rest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With