Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

P6 Architecture - Register renaming aside, does the limited user registers result in more ops spent spilling/loading?

I'm studying JIT design with regard to dynamic languages VM implementation. I haven't done much Assembly since the 8086/8088 days, just a little here or there, so be nice if I'm out of sorts.

As I understand it, the x86 (IA-32) architecture still has the same basic limited register set today that it always did, but the internal register count has grown tremendously, but these internal registers are not generally available and are used with register renaming to achieve parallel pipelining of code that otherwise could not be parallelizable. I understand this optimization pretty well, but my feeling is, while these optimizations help in overall throughput and for parallel algorithms, the limited register set we are still stuck with results in more register spilling overhead such that if x86 had double, or quadruple the registers available to us, there may be significantly less push/pop opcodes in a typical instruction stream? Or are there other processor optmizations that also optimize this away that I am unaware of? Basically if I've a unit of code that has 4 registers to work with for integer work, but my unit has a dozen variables, I've got potentially a push/pop for every 2 or so instructions.

Any references to studies, or better yet, personal experiences?

EDIT: x86_64 has 16 registers, which is double x86-32, thanks for the correction and info.

like image 809
codenheim Avatar asked Mar 17 '10 06:03

codenheim


2 Answers

In addition to renaming registers to hide bubbles due to instruction latencies, most x86 architectures are smart enough to count pushes and pops and rename those onto registers as well. Remember that the instruction decoder on the x86 actually performs a sort of JIT compilation, turning the x86 instruction stream into a small microcode program stored in the trace cache. Part of this process includes intercepting small-offset stack loads and turning those into registers as well. Thus something like (the patently silly and purely for example):

lwz eax,[ebp]
lwz ebx,[ebp+4]
add eax,[edx+0]
push eax 
lwz eax,[ebp+8]
add eax,ebx
pop ebx
add eax,ebx

cooks into something like (pretend internal registers are named eg r0..r16):

lw r3, edx
lw r1, ebp
lw r2, ebp+4 ; the constant '4' is usually stored as an immediate operand
add r1,r2
or r4,r1,r1 ;; move r1 to r4
lw r1, ebp+8
add r1,r2
or r2,r4,r4
add r1,r2

Of course a magically smart decoder (unlike the one that actually fits into transistor count) would collapse some of the unnecessary moves there, but the point I am making is that push/pop and stores/loads to esp+(some small number) are actually turned into shadow registers.

like image 156
Crashworks Avatar answered Nov 15 '22 08:11

Crashworks


Two points:

(1) x86-64 doubles the number of registers to 16

(2) in modern x86 CPUs, an instruction that uses a memory location that is already in L1 cache is nearly as fast as the same operation with a register operand, so you can almost think of L1 as "register memory"

like image 39
Paul R Avatar answered Nov 15 '22 08:11

Paul R