Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are stackless C++20 coroutines a problem?

Based on the following it looks like coroutines in C++20 will be stackless.

https://en.cppreference.com/w/cpp/language/coroutines

I'm concerned for many reasons:

  1. On embedded systems heap allocation is often not acceptable.
  2. When in low level code, nesting of co_await would be useful (I don't believe stackless co-routines allow this).

With a stackless coroutine, only the top-level routine may be suspended. Any routine called by that top-level routine may not itself suspend. This prohibits providing suspend/resume operations in routines within a general-purpose library.

https://www.boost.org/doc/libs/1_57_0/libs/coroutine/doc/html/coroutine/intro.html#coroutine.intro.stackfulness

  1. More verbose code because of the need for custom allocators and memory pooling.

  2. Slower if the task waits for the operating system to allocate it some memory (without memory pooling).

Given these reasons, I'm really hoping I'm very wrong about what the current coroutines are.

The question has three parts:

  1. Why would C++ choose to use stackless coroutines?
  2. Regarding allocations to save state in stackless coroutines. Can I use alloca() to avoid any heap allocations that would normally be used for the coroutine creation.

coroutine state is allocated on the heap via non-array operator new. https://en.cppreference.com/w/cpp/language/coroutines

  1. Are my assumptions about the c++ coroutines wrong, why?

EDIT:

I'm going through the cppcon talks for the coroutines now, if I find any answers to my own question I'll post it (nothing so far).

CppCon 2014: Gor Nishanov "await 2.0: Stackless Resumable Functions"

https://www.youtube.com/watch?v=KUhSjfSbINE

CppCon 2016: James McNellis “Introduction to C++ Coroutines"

https://www.youtube.com/watch?v=ZTqHjjm86Bw

like image 678
David Ledger Avatar asked Jul 23 '19 11:07

David Ledger


People also ask

What is a Stackless coroutine?

A coroutine is a function that can suspend execution to be resumed later. Coroutines are stackless: they suspend execution by returning to the caller and the data that is required to resume execution is stored separately from the stack.

Are coroutines asynchronous C++?

C++ coroutines can also be used for asynchronous programming by having a coroutine represent an asynchronous computation or an asynchronous task.

Does C++ have coroutines?

The main idea of coroutines is to have a function that preserves a state while releasing control back to the caller. Coroutines in C++ are a complex beast. The coroutine implementer should manage the frame to be created when yielding out, but we used an external library that manages this for us.

What are coroutines in programming?

Coroutines are computer program components that generalize subroutines for non-preemptive multitasking, by allowing execution to be suspended and resumed. Coroutines are well-suited for implementing familiar program components such as cooperative tasks, exceptions, event loops, iterators, infinite lists and pipes.


1 Answers

I use stackless coroutines on small, hard-realtime ARM Cortex-M0 targets, with 32kb of RAM, where there's no heap allocator present at all: all memory is statically preallocated. The stackless coroutines are a make-or-break, and stackful coroutines that I had previously used were a pain to get right, and were essentially a hack wholly based on implementation-specific behavior. Going from that mess to standards-compliant, portable C++, was wonderful. I shudder to think that someone might suggest going back.

  • Stackless coroutines don't imply heap use: you have full control over how the coroutine frame is allocated (via void * operator new(size_t) member in promise type).

  • co_await can be nested just fine, in fact it's a common use case.

  • Stackful coroutines have to allocate those stacks somewhere as well, and it's perhaps ironic that they can't use the thread's primary stack for that. These stacks are allocated on the heap, perhaps via a pool allocator that gets a block from heap and then subdivides it.

  • Stackless coroutine implementations can elide frame allocation, such that the promise's operator new is not called at all, whereas stackful coroutines always allocate the stack for the coroutine, whether needed or not, because the compiler can't help the coroutine runtime with eliding it (at least not in C/C++).

  • The allocations can be elided precisely by using the stack where the compiler can prove that the life of the coroutine doesn't leave the scope of the caller. And that's the only way you can use alloca. So, the compiler already takes care of it for you. How cool is that!

    Now, there's no requirement that the compilers actually do this elision, but AFAIK all implementations out there do this, with some sane limits on how complex that "proof" can be - in some cases it's not a decidable problem (IIRC). Plus, it's easy to check whether the compiler did as you expected: if you know that all coroutines with a particular promise type are nested-only (reasonable in small embedded projects but not only!), you can declare operator new in the promise type but not define it, and then the code won't link if the compiler "goofed up".

    A pragma could be added to a particular compiler implementation to declare that a particular coroutine frame doesn't escape even if the compiler isn't clever enough to prove it - I didn't check if anyone bothered to write these yet, because my use cases are reasonable enough that the compiler always does the right thing.

    Memory allocated with alloca cannot be used after you return from the caller. The use case for alloca, in practice, is to be a slightly more portable way of expressing gcc's variable-size automatic array extension.

In essentially all implementations of stackful coroutines in C-like lanaguages, the one and only supposed "benefit" of stackfull-ness is that the frame is accessed using the usual base-pointer-relative addressing, and push and pop where appropriate, so "plain" C code can run on this made-up stack, with no changes to code generator. No benchmarks support this mode of thinking, though, if you have lots of coroutines active - it's a fine strategy if there's a limited number of them, and you have the memory to waste to start with.

Stack has to be overallocated, decreasing locality of reference: a typical stackful coroutine uses a full page for the stack at the minimum, and the cost of making this page available is not shared with anything else: the single coroutine has to bear it all. That's why it was worthwhile to develop stackless python for multiplayer game servers.

If there's a couple of couroutines only - no problem. If you've got thousands of network requests all handled by stackful coroutines, with a light networking stack that doesn't impose overhead that monopolizes the performance, the performance counters for cache misses will make you cry. As Nicol has stated in the other answer, this becomes somewhat less relevant the more layers there are between the coroutine and whatever asynchronous operation it's handling.

It has been long since any 32+-bit CPU had performance benefits inherent to memory access via any particular addressing mode. What matters is cache-friendly access patterns and leveraging prefetch, branch prediction and speculative execution. Paged memory and its backing store are just two further levels of cache (L4 and L5 on desktop CPUs).

  1. Why would C++ choose to use stackless coroutines? Because they perform better, and no worse. On the performance side, there can be only benefits to them. So it's a no-brainer, performance-wise, to just use them.

  2. Can I use alloca() to avoid any heap allocations that would normally be used for the coroutine creation. No. It'd be a solution to a nonexistent problem. Stackful coroutines don't actually allocate on the existing stack: they create new stacks, and those are allocated on the heap by default, just as C++ coroutine frames would be (by default).

  3. Are my assumptions about the c++ coroutines wrong, why? See above.

  4. More verbose code because of the need for custom allocators and memory pooling. If you want stackful coroutines to perform well, you'll be doing the same thing to manage the memory areas for the stacks, and it turns out that it's even harder. You need to minimize memory waste, and thus you need to minimally overallocate the stack for the 99.9% use case, and deal somehow with coroutines that exhaust this stack.

    One way I have dealt with it in C++ was by doing stack checks in branch points where code analysis indicates more stack may be needed, then if the stack would overflow, an exception was thrown, the coroutine's work undone (the design of the system had to support it!), and then the work restarted with more stack. It's an easy way to quickly lose benefits of tightly packed stack-fuls. Oh, and I had to provide my own __cxa_allocate_exception for that to work. Fun, eh?

One more anecdote: I'm playing with using coroutines inside Windows kernel-mode drivers, and there the stacklessness does matter - to the extent that if the hardware allows, you can allocate the packet buffer and the coroutine's frame together, and these pages are pinned when they are submitted to the network hardware for execution. When the interrupt handler resumes the coroutine, the page is there, and if the network card allows, it could even prefetch it for you so it'll be in the cache. So that works well - it's just one use case, but since you wanted embedded - I've got embedded :).

It's perhaps not common to think of drivers on desktop platforms as "embedded" code, but I see lots of similarities, and an embedded mindset is needed. Last thing you want is kernel code that allocates too much, especially if it would add per-thread overhead. A typical desktop PC has a few thousand threads present, and a lot of them are there to handle I/O. Now imagine a diskless system that uses iSCSI storage. On such a system, anything I/O bound that's not bound to USB or GPU will be bound to the network hardware and the networking stack.

Finally: Trust benchmarks, not me, and read Nicol's answer too!. My perspective is shaped by my use cases - I can generalize, but I claim no first-hand experience with coroutines in "generalist" code where performance is of less concern. Heap allocations for stackless coroutines are very often hardly noticeable in performance traces. In general-purpose application code, it's rarely going to be a problem. It does get "interesting" in library code, and some patterns have to be developed to allow the library user to customize this behavior. These patterns will be found and popularized as more libraries use C++ coroutines.

like image 78
Kuba hasn't forgotten Monica Avatar answered Oct 15 '22 19:10

Kuba hasn't forgotten Monica