Asyncio Pandas with Inplace

Tags:

I just read this introduction, but am having trouble implementing either of the examples (commented code being the second example):

import asyncio
import pandas as pd
from openpyxl import load_workbook

async def loop_dfs(dfs):
    async def clean_df(df):
        df.drop(["column_1"], axis=1, inplace=True)
        ... a bunch of other inplace=True functions ...
        return "Done"

    # tasks = [clean_df(df) for (table, dfs) in dfs.items()]
    # await asyncio.gather(*tasks)

    tasks = [clean_df(df) for (table, df) in dfs.items()]
    completed, pending = await asyncio.wait(tasks)


def main():
    dfs = {
        sn: pd.read_excel("excel.xlsx", sheet_name=sn)
        for sn in load_workbook("excel.xlsx").sheetnames
    }

    # loop = asyncio.get_event_loop()
    # loop.run_until_complete(loop_dfs(dfs))

    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(loop_dfs(dfs))
    finally:
        loop.close()

main()

I saw a few other posts about how pandas doesn't support asyncio, and maybe i'm just missing a bigger picture, but that shouldn't matter if i'm doing inplace operations right? I saw recommendations for Dask but without immediate support for reading excel, figured i'd try this first but I keep getting

RuntimeError: Event loop already running

761

asked Sep 14 '18 23:09

Tony

1 Answers

I saw a few other posts about how pandas doesn't support asyncio, and maybe i'm just missing a bigger picture, but that shouldn't matter if i'm doing inplace operations right?

In-place operations are those that modify existing data. That is a matter of efficiency, whereas your goal appears to be parallelization, an entirely different matter.

Pandas doesn't support asyncio not only because this wasn't yet implemented, but because Pandas doesn't typically do the kind of operations that asyncio supports well: network and subprocess IO. Pandas functions either use the CPU or wait for disk access, neither of which is a good fit for asyncio. Asyncio allows network communication to be expressed with coroutines that look like ordinary synchronous code. Inside a coroutine each blocking operation (e.g. a network read) is awaited, which automatically suspends the whole task if the data is not yet available. At each such suspension the system switches to the next task, creating effectively a cooperative multi-tasking system.

When trying to call a library that doesn't support asyncio, such as pandas, things will superficially appear to work, but you won't get any benefit and the code will run serially. For example:

async def loop_dfs(dfs):
    async def clean_df(df):
        ...    
    tasks = [clean_df(df) for (table, df) in dfs.items()]
    completed, pending = await asyncio.wait(tasks)

Since clean_df doesn't contain a single instance of await, it is a coroutine in name only - it will never actually suspend its execution to allow other coroutines to run. Thus await asyncio.wait(tasks) will run the tasks in series, as if you wrote:

for table, df in dfs.items():
    clean_df(df)

To get things to run in parallel (provided pandas occasionally releases the GIL during its operations), you should hand off the individual CPU-bound functions to a thread pool:

async def loop_dfs(dfs):
    def clean_df(df):  # note: ordinary def
        ...
    loop = asyncio.get_event_loop(0
    tasks = [loop.run_in_executor(clean_df, df)
             for (table, df) in dfs.items()]
    completed, pending = await asyncio.wait(tasks)

If you go down that route, you don't need asyncio in the first place, you can simply use concurrent.futures. For example:

def loop_dfs(dfs):  # note: ordinary def
    def clean_df(df):
        ...
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [executor.submit(clean_df, df)
                   for (table, df) in dfs.items()]
        concurrent.futures.wait(futures)

figured i'd try this first but I keep getting RuntimeError: Event loop already running

That error typically means that you've started the script in an environment that already runs asyncio for you, such as a jupyter notebook. If that is the case, make sure that you run your script with stock python, or consult your notebook's documentation how to change your code to submit the coroutines to the event loop that already runs.

answered Sep 29 '22 18:09

user4815162342

Related questions
                            
                                Removing elements that have consecutive partial duplicates in Python
                            
                                Why doesn't python support 'pass' in ternary expression?
                            
                                How to bucketize a group of columns in pyspark?
                            
                                Keras with Tensorflow backend - Run predict on CPU but fit on GPU
                            
                                How do I plot a step function with Seaborn?
                            
                                PyTest: Auto delete temporary directory created with tmpdir_factory
                            
                                How to install python packages like pip, numpy on Amazon EC2 - ubuntu
                            
                                Pass binary data between python processes
                            
                                How to 3D plot function of 2 variables in python?
                            
                                Does python have a shorthand to check if an object has an attribute? [duplicate]
                            
                                Convert a string with brackets to numpy array
                            
                                dlib not using CUDA
                            
                                Fitting of experimental data within two different regions
                            
                                How to fill area under line plot in seaborn
                            
                                Passing arguments to sql template from airflow operator
                            
                                Is some_dict.items() an iterator in Python?
                            
                                Pytorch model accuracy test
                            
                                TensorFlow RuntimeError: MetaGraphDef associated with tags serve could not be found in SavedModel
                            
                                pandas map makes values NaN
                            
                                Update Googlesheet cell with timestamp from Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Asyncio Pandas with Inplace

Tags:

python

python-3.x

pandas

python-asyncio

Tony

People also ask

1 Answers

user4815162342

Recent Activity

Donate For Us