I use various double machine word types, like e.g. (u)int128_t on x86_64 and (u)int64_t on i386, ARM etc. in GCC. I am looking for a correct/portable/clean way of accessing and manipulating the individual actual machine words (mostly in assembler). E.g. on 32bit machines I want to directly access the high/low 32bit part of an int64_t which gcc uses internally, without using stupid error-prone code like below. Similarly for the "native" 128bit types I want to access the 64b parts gcc is using (not for the below example as "add" is simple enough, but generally).
Consider the 32bit ASM path in the following code to add two int128_t together (which may be "native" to gcc, "native" to the machine or "half native" to the machine); it's horrendous and hard to maintain (and slower).
#define BITS 64
#if defined(USENATIVE)
// USE "NATIVE" 128bit GCC TYPE
typedef __int128_t int128_t;
typedef __uint128_t uint128_t;
typedef int128_t I128;
#define HIGH(x) x
#define HIGHVALUE(x) ((uint64_t)(x >> BITS))
#define LOW(x) x
#define LOWVALUE(x) (x & UMYINTMAX)
#else
typedef struct I128 {
int64_t high;
uint64_t low;
} I128;
#define HIGH(x) x.high
#define HIGHVALUE(x) x.high
#define LOW(x) x.low
#define LOWVALUE(x) x.low
#endif
#define HIGHHIGH(x) (HIGHVALUE(x) >> (BITS / 2))
#define HIGHLOW(x) (HIGHVALUE(x) & 0xFFFFFFFF)
#define LOWHIGH(x) (LOWVALUE(x) >> (BITS / 2))
#define LOWLOW(x) (LOWVALUE(x) & 0xFFFFFFFF)
inline I128 I128add(I128 a, const I128 b) {
#if defined(USENATIVE)
return a + b;
#elif defined(USEASM) && defined(X86_64)
__asm(
"ADD %[blo], %[alo]\n"
"ADC %[bhi], %[ahi]"
: [alo] "+g" (a.low), [ahi] "+g" (a.high)
: [blo] "g" (b.low), [bhi] "g" (b.high)
: "cc"
);
return a;
#elif defined(USEASM) && defined(X86_32)
// SLOWER DUE TO ALL THE CRAP
int32_t ahihi = HIGHHIGH(a), bhihi = HIGHHIGH(b);
uint32_t ahilo = HIGHLOW(a), bhilo = HIGHLOW(b);
uint32_t alohi = LOWHIGH(a), blohi = LOWHIGH(b);
uint32_t alolo = LOWLOW(a), blolo = LOWLOW(b);
__asm(
"ADD %[blolo], %[alolo]\n"
"ADC %[blohi], %[alohi]\n"
"ADC %[bhilo], %[ahilo]\n"
"ADC %[bhihi], %[ahihi]\n"
: [alolo] "+r" (alolo), [alohi] "+r" (alohi), [ahilo] "+r" (ahilo), [ahihi] "+r" (ahihi)
: [blolo] "g" (blolo), [blohi] "g" (blohi), [bhilo] "g" (bhilo), [bhihi] "g" (bhihi)
: "cc"
);
a.high = ((int64_t)ahihi << (BITS / 2)) + ahilo;
a.low = ((uint64_t)alohi << (BITS / 2)) + alolo;
return a;
#else
// this seems faster than adding to a directly
I128 r = {a.high + b.high, a.low + b.low};
// check for overflow of low 64 bits, add carry to high
// avoid conditionals
r.high += r.low < a.low || r.low < b.low;
return r;
#endif
}
Please note that I don't use C/ASM much, in fact this is my first attempt at inline ASM. Being used to Java/C#/JS/PHP etc. means that something very obvious to a routine C dev may not be apparent to me (besides the obvious insecure quirkiness in code style ;)). Also all this may be called something else entirely, because I had a very hard time finding anything online regarding the subject (non-native speaker as well).
Thanks a lot!
After much digging I have found the following theoretical solution, which works, but is unnecessary slow (slower than the much longer gcc output!) because it forces everything to memory and I am looking for a generic solution (reg/mem/possibly imm). I have also found that if you use an "r" constraint on e.g. 64bit int on 32bit machine, gcc will actually put both values in 2 registers (e.g. eax and ebx). The problem is not being able to reliably access the second part. I am sure there some hidden operator modifier that's just hard to find to tell gcc I want to access that second part.
uint32_t t1, t2;
__asm(
"MOV %[blo], %[t1]\n"
"MOV 4+%[blo], %[t2]\n"
"ADD %[t1], %[alo]\n"
"ADC %[t2], 4+%[alo]\n"
"MOV %[bhi], %[t1]\n"
"MOV 4+%[bhi], %[t2]\n"
"ADC %[t1], %[ahi]\n"
"ADC %[t2], 4+%[ahi]\n"
: [alo] "+o" (a.low), [ahi] "+o" (a.high), [t1] "=&r" (t1), [t2] "=&r" (t2)
: [blo] "o" (b.low), [bhi] "o" (b.high)
: "cc"
);
return a;
I have heard that this site isn't really for "code review", but since this is an interesting aspect to me, I think I can offer some advice:
In the 32-bit version, you could do the HIGHHIGH
and co. with some clever overlaying of arrays of int/uint, instead of shifting and anding. Using a union is one way to do this, the other being pointer "magic". Since the assembler part of the code isn't particularly portable in the first place, using non-portable code in form of type punning casts or non-portable unions isn't really that big a deal.
Edit: relying on position within the words may also be fine. So for example, just passing in the address of the input and the output in registers, and then using (%0+4) and (%1+4) to do the remaining parts is definitely an option.
It will of course get more interesting if you have to do this for multiply and divide... I'm not sure I'd like to go there...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With