I am currently running some endless tasks using <code>asyncio.wait</code> I need a special function to run when all the others are on <code>await</code> <pre class="prettyprint"><code>import asyncio async def special_function(): while True: # does some work, # Passes control back to controller to run main_tasks # if they are no longer waiting. await asyncio.sleep(0) async def handler(): tasks = [task() for task in main_tasks] # Adding the task that I want to run when all main_tasks are awaiting: tasks.append(special_function()) await asyncio.wait(tasks) asyncio.get_event_loop().run_until_complete(handler()) </code></pre> How can I get the <code>special_function</code> to only be run when all <code>main_tasks</code> are on <code>await</code>? <hr> <h3>Edit:</h3> What I mean by "all <code>main_tasks</code> are on <code>await</code>": all <code>main_tasks</code> are not ready to continue, e.g. are in <code>asyncio.sleep(100)</code> or I/O bound and still waiting for data. Therefore the <code>main_tasks</code> cannot continue and the event loop runs the <code>special_function</code> while the tasks are in this state, NOT every iteration of the event loop. <hr> <h3>Edit 2:</h3> My use case: The <code>main_tasks</code> are updating a data structure with new data from web-sockets. The <code>special_function</code> transfers that data to another process upon an update signal from that process. (<code>multiprocessing</code> with shared variables and data structures) It needs to be the most up to date data it can be when it transfers, there cannot be pending updates from main_tasks. This is why I only want to run special_function when there are no main_tasks with new data available to be processed. (i.e. all waiting on <code>await</code>)

I tried to write a test for the 'task not ready to run' condition. I think asyncio does not expose details from the scheduler. The developers have clearly stated they want to keep freedom for changing asyncio internals without breaking backward compatibility. In <code>asyncio.Task</code> there is this comment (note: <code>_step()</code> runs the task coroutine till the next await): <pre class="prettyprint"><code># An important invariant maintained while a Task not done: # # - Either _fut_waiter is None, and _step() is scheduled; # - or _fut_waiter is some Future, and _step() is *not* scheduled. </code></pre> But that internal variable is not in the API, of course. You can get some limited access to <code>_fut_waiter</code> by reading the output of <code>repr(task)</code>, but the format seems to be not reliable either, so I would not depend on somehing like this: <pre class="prettyprint"><code>PENDINGMSG = 'wait_for=<Future pending ' if all(PENDINGMSG in repr(t) for t in monitored_tasks): do_something() </code></pre> <hr> Anyway, I think you are trying to be too perfect. You want to know if there is new data in other tasks. What if the data is in asyncio buffers? Kernel buffer? Network card receive buffer? ... You could never know if new data arrives the next millisecond. My suggestion: write all updates to a single queue. Check that queue as the only source of updates. If the queue is empty, publish the last state.

This is what I'd do: <ol> <li>I'd not use your special function.</li> <li>Each data update needs a separate generation ID (4 byte integer), and I'd only put in the ID in shared memory.</li> </ol> Both processes are running independently, I assume. <ol start="3"> <li>The subscriber keeps the generation ID as local. When it notices the generation ID is changed in shared memory, then the read new data from the file.</li> <li>Data is stored on tmpfs (/tmp) so it's on memory. You can create your own tmpfs if suited. It's fast.</li> </ol> Here is why: <ul> <li>To make sure the subscriber doesn't fetch half-baked data in shared memory, it has to be protected by semaphore. It's a PITA</li> <li>By using file, you can carry variable size data. This may not apply to you. One of hard things to solve when using shared memory is to have enough space but not waste space. Using file solves this problem.</li> <li>By using 4-byte int generation ID, updating ID is atomic. This is a huge advantage. </li> </ul> So, as one of your tasks receives new data, open a file, write to it, and after closing the file descriptor, you write out the generation ID to shared memory. Before updating generation ID, you can delete the file safely. The subscriber - if it has opened file, it will complete reading the file, and if it tries to open it, it fails to open so it has to wait for the next generation anyway. If machine crashes, /tmp is gone so you don't need to worry about cleaning up files. You can even write a new task which solo job is to delete files in /tmp that is older generations if you like.

asyncio: running task only if all other tasks are awaiting

Tags:

python

concurrency

python-asyncio

I am currently running some endless tasks using asyncio.wait

I need a special function to run when all the others are on await

import asyncio 

async def special_function():
    while True:
        # does some work, 
        # Passes control back to controller to run main_tasks
        # if they are no longer waiting.
        await asyncio.sleep(0)

async def handler():
    tasks = [task() for task in main_tasks]

    # Adding the task that I want to run when all main_tasks are awaiting:
    tasks.append(special_function())

    await asyncio.wait(tasks)

asyncio.get_event_loop().run_until_complete(handler())

How can I get the special_function to only be run when all main_tasks are on await?

Edit:

What I mean by "all main_tasks are on await": all main_tasks are not ready to continue, e.g. are in asyncio.sleep(100) or I/O bound and still waiting for data.

Therefore the main_tasks cannot continue and the event loop runs the special_function while the tasks are in this state, NOT every iteration of the event loop.

Edit 2:

My use case:

The main_tasks are updating a data structure with new data from web-sockets.

The special_function transfers that data to another process upon an update signal from that process. (multiprocessing with shared variables and data structures)

It needs to be the most up to date data it can be when it transfers, there cannot be pending updates from main_tasks.

This is why I only want to run special_function when there are no main_tasks with new data available to be processed. (i.e. all waiting on await)

566

asked May 26 '19 00:05

Zak Stucke

2 Answers

I tried to write a test for the 'task not ready to run' condition. I think asyncio does not expose details from the scheduler. The developers have clearly stated they want to keep freedom for changing asyncio internals without breaking backward compatibility.

In asyncio.Task there is this comment (note: _step() runs the task coroutine till the next await):

# An important invariant maintained while a Task not done:
#   
# - Either _fut_waiter is None, and _step() is scheduled;
# - or _fut_waiter is some Future, and _step() is *not* scheduled.

But that internal variable is not in the API, of course.

You can get some limited access to _fut_waiter by reading the output of repr(task), but the format seems to be not reliable either, so I would not depend on somehing like this:

PENDINGMSG = 'wait_for=<Future pending '

if all(PENDINGMSG in repr(t) for t in monitored_tasks):
    do_something()

Anyway, I think you are trying to be too perfect. You want to know if there is new data in other tasks. What if the data is in asyncio buffers? Kernel buffer? Network card receive buffer? ... You could never know if new data arrives the next millisecond.

My suggestion: write all updates to a single queue. Check that queue as the only source of updates. If the queue is empty, publish the last state.

114

answered Sep 17 '22 13:09

VPfB

This is what I'd do:

I'd not use your special function.
Each data update needs a separate generation ID (4 byte integer), and I'd only put in the ID in shared memory.

Both processes are running independently, I assume.

The subscriber keeps the generation ID as local. When it notices the generation ID is changed in shared memory, then the read new data from the file.
Data is stored on tmpfs (/tmp) so it's on memory. You can create your own tmpfs if suited. It's fast.

Here is why:

To make sure the subscriber doesn't fetch half-baked data in shared memory, it has to be protected by semaphore. It's a PITA
By using file, you can carry variable size data. This may not apply to you. One of hard things to solve when using shared memory is to have enough space but not waste space. Using file solves this problem.
By using 4-byte int generation ID, updating ID is atomic. This is a huge advantage.

So, as one of your tasks receives new data, open a file, write to it, and after closing the file descriptor, you write out the generation ID to shared memory. Before updating generation ID, you can delete the file safely. The subscriber - if it has opened file, it will complete reading the file, and if it tries to open it, it fails to open so it has to wait for the next generation anyway. If machine crashes, /tmp is gone so you don't need to worry about cleaning up files. You can even write a new task which solo job is to delete files in /tmp that is older generations if you like.

answered Sep 19 '22 13:09

Naoyuki Tai

Related questions
                            
                                Python3 is suddenly gone (on macOS) - used it for at least a year
                            
                                Does the performance of numpy differ depending on the operating system?
                            
                                Where is the value when I do this in pandas Series
                            
                                Flake 8: "multiple statements on one line (colon)" only for variable name starting with "if"
                            
                                plotly.py: change line opacity, leave markers opaque
                            
                                VS Code python extension recently started complaining about a Path error on Win10
                            
                                Convert Points to Lines Geopandas
                            
                                Error "You must compile your model before using it" in case of LSTM and fit_generator in Keras
                            
                                How to change a python thread name from inside the thread on Windows?
                            
                                Checking that a pandas.Series.index contains a value
                            
                                OpenCV Rectangle Filled
                            
                                portable conda environment as a binary tarball
                            
                                Is the class generator (inheriting Sequence) thread safe in Keras/Tensorflow?
                            
                                How to change proxy on my webdriver multiple times on a single session?
                            
                                Why does a = a['k'] = {} create an infinitely nested dictionary?
                            
                                Python typing for a subclass of list
                            
                                Some of my columns get missing when I use df.corr in Pandas
                            
                                Consequences for virtual env when system's Python is removed and/or updated
                            
                                Why is sphinx automodule not showing any module members?
                            
                                Anaconda python: PackagesNotFoundError error when trying to roll back revision

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With