I decided I should try to implement coroutines (I think that's how I should call them) for fun and profits. I expect to have to use assembler, and probably some C if I want to make this actually useful for anything.
Bear in mind that this is for educational purposes. Using an already built coroutine library is too easy (and really no fun).
You guys know setjmp
and longjmp
? They allow you to unwind the stack up to a predefined location, and resumes execution from there. However, it can't rewind to "later" on the stack. Only come back earlier.
jmpbuf_t checkpoint;
int retval = setjmp(&checkpoint); // returns 0 the first time
/* lots of stuff, lots of calls, ... We're not even in the same frame anymore! */
longjmp(checkpoint, 0xcafebabe); // execution resumes where setjmp is, and now it returns 0xcafebabe instead of 0
What I'd like is a way to run, without threading, two functions on different stacks. (Obviously, only one runs at a time. No threading, I said.) These two functions must be able to resume the other's execution (and halt their own). Somewhat like if they were longjmp
ing to the other. Once it returns to the other function, it must resume where it left (that is, during or after the call that gave control to the other function), a bit like how longjmp
returns to setjmp
.
This is how I thought it:
A
creates and zeroes a parallel stack (allocates memory and all that).A
pushes all its registers to the current stack.A
sets the stack pointer and the base pointer to that new location, and pushes a mysterious data structure indicating where to jump back and where to set the instruction pointer back.A
zeroes most of its registers and sets the instruction pointer to the beginning of function B
.That's for the initialization. Now, the following situation will indefinitely loop:
B
works on that stack, does whatever work it needs to.B
comes to a point where it needs to interrupt and give A
control again.B
pushes all of its registers to its stack, takes the mysterious data structure A
gave it at the very beginning, and sets the stack pointer and the instruction pointer to where A
told it to. In the process, it hands back A
a new, modified data structure that tells where to resume B
.A
wakes up, popping back all the registers it pushed to its stack, and does work until it comes to a point where it needs to interrupt and give B
control again.All this sounds good to me. However, there is a number of things I'm not exactly at ease with.
pusha
instruction that would send all registers to the stack. However, processor architectures evolve, and now with x86_64 we've got a lot more general-purpose registers, and likely several SSE registers. I couldn't find any evidence that pusha
does push them. There are about 40 public registers in a mordern x86 CPU. Do I have to do all the push
es myself? Moreover, there is no push
for SSE registers (though there's bound to be an equivalent—I'm new to this whole "x86 assembler" thing).mov rip, rax
(Intel syntax)? Also, getting the value from it must be somewhat special as it constantly changes. If I do like mov rax, rip
(Intel syntax again), will rip
be positioned on the mov
instruction, to the instruction after it, or somewhere between?jmp foo
. Dummy.pthread
for threads?Thanks for reading my question textwall.
You are correct in that PUSHA
wont work on x64 it will raise the exception #UD
, as PUSHA
only pushes the 16-bit or 32-bit general purpose registers. See the Intel manuals for all the info you ever wanted to know.
Setting RIP
is simple, jmp rax
will set RIP
to RAX
. To retrieve RIP, you could either get it at compile time if you already know all the coroutine exit origins, or you could get it at run time, you can make a call to the next address after that call. Like this:
a:
call b
b:
pop rax
RAX
will now be b
. This works because CALL
pushes the address of the next instruction. This technique works on IA32 as well (although I'd suppose there's a nicer way to do it on x64, as it supports RIP-relative addressing, but I don't know of one). Of course if you make a function coroutine_yield
, it can just intercept the caller address :)
Since you can't push all the registers to the stack in a single instruction, I wouldn't recommend storing the coroutine state on the stack, as that complicates things anyways. I think the nicest thing to do would be to allocate a data structure for every coroutine instance.
Why are you zeroing things in function A
? That's probably not necessary.
Here's how I would approach the entire thing, trying to make it as simple as possible:
Create a structure coroutine_state
that holds the following:
initarg
arg
registers
(also contains the flags)caller_registers
Create a function:
coroutine_state* coroutine_init(void (*coro_func)(coroutine_state*), void* initarg);
where coro_func
is a pointer to the coroutine function body.
This function does the following:
coroutine_state
structure cs
initarg
to cs.initarg
, these will be the initial argument to the coroutinecoro_func
to cs.registers.rip
cs.registers
(not registers, only flags, as we need some sane flags to prevent an apocalypse)cs.registers.rsp
coroutine_state
structureNow we have another function:
void* coroutine_next(coroutine_state cs, void* arg)
where cs
is the structure returned from coroutine_init
which represents a coroutine instance, and arg
will be fed into the coroutine as it resumes execution.
This function is called by the coroutine invoker to pass in some new argument to the coroutine and resume it, the return value of this function is an arbitrary data structure returned (yielded) by the coroutine.
cs.caller_registers
except for RSP
, see step 3.arg
in cs.arg
cs.caller_registers.rsp
), adding 2*sizeof(void*)
will fix it if you're lucky, you'd have to look this up to confirm it, you probably want this function to be stdcall so no registers are tampered with before calling itmov rax, [rsp]
, assign RAX
to cs.caller_registers.rip
; explanation: unless your compiler is on crack, [RSP]
will hold the instruction pointer to the instruction that follows the call instruction that called this function (ie: the return address)cs.registers
jmp cs.registers.rip
, efectively resuming execution of the coroutineNote that we never return from this function, the coroutine we jump to "returns" for us (see coroutine_yield
). Also note that inside this function you may run into many complications such as function prologue and epilogue generated by the C compiler, and perhaps register arguments, you have to take care of all this. Like I said, stdcall will save you lots of trouble, I think gcc's -fomit-frame_pointer will remove the epilogue stuff.
The last function is declared as:
void coroutine_yield(void* ret);
This function is called inside the coroutine to "pause" execution of the coroutine and return to the caller of coroutine_next
.
in cs.registers
cs.registers.rsp
), once again, add 2*sizeof(void*)
to it, and you want this function to be stdcall as wellmov rax, arg
(lets just pretend all the functions in your compiler return their arguments in RAX
)cs.caller_registers
jmp cs.caller_registers.rip
This essentially returns from the coroutine_next
call on the coroutine invoker's stack frame, and since the return value is passed in RAX
, we returned arg
. Let's just say if arg
is NULL
, then the coroutine has terminated, otherwise it's an arbitrary data structure.So to recap, you initialize a coroutine using coroutine_init
, then you can repeatedly invoke the instantiated coroutine with coroutine_next
.
The coroutine's function itself is declared:
void my_coro(coroutine_state cs)
cs.initarg
holds the initial function argument (think constructor). Each time my_coro
is called, cs.arg
has a different argument that was specified by coroutine_next
. This is how the coroutine invoker communicates with the coroutine. Finally, every time the coroutine wants to pause itself, it calls coroutine_yield
, and passes one argument to it, which is the return value to the coroutine invoker.
Okay, you may now think "thats easy!", but I left out all the complications of loading the registers and flags in the correct order while still maintaining a non corrupt stack frame and somehow keeping the address of your coroutine data structure (you just overwrote all your registers), in a thread-safe manner. For that part you will need to find out how your compiler works internally... good luck :)
Good learning reference: libcoroutine, especially their setjmp/longjmp implementation. I know its not fun to use an existing library, but you can at least get a general bearing on where you are going.
Simon Tatham has an interesting implementation of coroutines in C that doesn't require any architecture-specific knowledge or stack fiddling. It's not exactly what you're after, but I thought it might nonetheless be of at least academic interest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With