Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the overhead of an asyncio task? [closed]

What is the overhead of any asyncio task in terms of memory and speed? Is it ever worth minimising the number of tasks in cases when they don’t need to run concurrently?

like image 383
Michal Charemza Avatar asked Apr 19 '19 12:04

Michal Charemza


1 Answers

What is the overhead of any asyncio task in terms of memory and speed?

TL;DR The memory overhead appears negligible, but the time overhead can be significant, especially when the awaited coroutine chooses not to suspend.

Let's assume you are measuring the overhead of a task compared to a directly awaited coroutine, e.g.:

await some_coro()                       # (1)
await asyncio.create_task(some_coro())  # (2)

There is no reason to write (2) directly, but creating an unnecessary task can easily arise when using APIs that automatically "futurize" the awaitables they receive, such as asyncio.gather or asyncio.wait_for. (I suspect that building or use of such an abstraction is in the background of this question.)

It is straightforward to measure the memory and time difference between the two variants. For example, the following program creates a million tasks, and the memory consumption of the process can be divided by a million to get an estimate of the memory cost of a task:

async def noop():
    pass

async def mem1():
    tasks = [asyncio.create_task(noop()) for _ in range(1000000)]
    time.sleep(60)  # not asyncio.sleep() in this case - we don't
                    # want our noop tasks to exit immediately

On my 64-bit Linux machine running Python 3.7, the process consumes approximately 1 GiB of memory. That's about 1 KiB per task+coroutine, and it counts both the memory for the task and the memory for its entry in the event loop bookkeeping. The following program measures an approximation of the overhead of just a coroutine:

async def mem2():
    coros = [noop() for _ in range(1000000)]
    time.sleep(60)

The above process takes about 550 MiB of memory, or 0.55 KiB per coroutine only. So it seems that while a task isn't exactly free, it doesn't impose a huge memory overhead over a coroutine, especially keeping in mind that the above coroutine was empty. If the coroutine had some state, the overhead would have been much smaller (in relative terms).

But what about the CPU overhead - how long does it take to create and await a task compared to just awaiting a coroutine? Let's try a simple measurement:

async def cpu1():
    t0 = time.time()
    for _ in range(1000000):
        await asyncio.create_task(noop())
    t1 = time.time()
    print(t1-t0)

On my machine this takes 27 seconds (on average, with very small variations) to run. The version without a task would look like this:

async def cpu2():
    t0 = time.time()
    for _ in range(1000000):
        await noop()
    t1 = time.time()
    print(t1-t0)

This one takes only 0.16 seconds, a factor of ~170! So it turns out that the time overhead of awaiting a task is non-negligible compared to awaiting a coroutine object. This is for two reasons:

  • Tasks are more expensive to create than coroutine objects, because they require initializing the base Future, then the properties of the Task itself, and finally inserting the task into the event loop, with its own bookkeeping.

  • A freshly created task is in a pending state, its constructor having scheduled it to start executing the coroutine at the first opportunity. Since the task owns the coroutine object, awaiting a fresh task cannot just start executing the coroutine; it has to suspend and wait for the task to get around to executing it. The awaiting coroutine will only resume after a full event loop iteration, even when awaiting a coroutine that chooses not to suspend at all! An event loop iteration is expensive because it goes through all runnable tasks and polls the kernel for IO and timeout activities. Indeed, strace of cpu1 shows two million calls to epoll_wait(2). cpu2 on the other hand only goes to the kernel for the occasional allocation-related mmap(), a couple thousand in total.

    In contrast, directly awaiting a coroutine doesn't yield to the event loop unless the awaited coroutine itself decides to suspend. Instead, it immediately goes ahead and starts executing the coroutine as if it were an ordinary function.

So, if your coroutine's happy path does not involve suspending (as is the case with non-contended sychronization primitives or with stream reading from a non-blocking socket that has data to provide), the cost of awaiting it is comparable to the cost of a function call. That is much faster than an event loop iteration required to awaiting a task, and can make a difference when latency matters.

like image 119
user4815162342 Avatar answered Oct 22 '22 07:10

user4815162342