Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I execute a function in "apply" to pandas dataframe asynchronously?

I have a pandas dataframe and on each row, I would like to execute a function. However, the function includes I/O call to a remote server and thus it is very slow if I call it simply using .apply() to the dataframe.

Here is an example:

def f(data):
    r = requests.get(data["url"])
    x = process(r.content)
    y = process_2(r.content)
    z = process_3(r.content)
    print("Done")

    return [x, y, z]

df.apply(lambda x: f(x), axis=1)

In this code, the problem is requests.get(data["url"]) takes a while and thus the entire apply() function is very slow at finishing. The print() is printed on the console at a few seconds of interval.

Is it possible to execute the apply() function asynchronously and get the result faster? My dataframe has 5,000+ rows and the function call to each raw should take a few seconds.

like image 576
Blaszard Avatar asked Jul 07 '19 20:07

Blaszard


People also ask

How does pandas DataFrame apply work?

The apply() function is used to apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame's index (axis=0) or the DataFrame's columns (axis=1).

Can we pickle DataFrame?

DataFrame - to_pickle() functionThe to_pickle() function is used to pickle (serialize) object to file. File path where the pickled object will be stored. A string representing the compression to use in the output file. By default, infers from the file extension in specified path.

Is pandas single threaded?

First, pandas is single threaded, meaning that it cannot leverage multiple cores in a machine or cluster.


2 Answers

Asynchronous I/O approach with well-known asyncio + aiohttp libraries:

Demonstrated on sample Dataframe and simple webpage content processing routines (to show the mechanics of the approach).
Let's say we need to count all header, link(<a>) and span tags through all urls and store the resulting counters in the source dataframe.

import pandas as pd
import asyncio
import aiohttp
from bs4 import BeautifulSoup


def count_headers(html):
    return len(list(html.select('h1,h2,h3,h4,h5,h6')))

def count_links(html):
    return len(list(html.find_all('a')))

def count_spans(html):
    return len(list(html.find_all('spans')))


df = pd.DataFrame({'id': [1, 2, 3], 'url': ['https://stackoverflow.com/questions',
                                            'https://facebook.com',
                                            'https://wiki.archlinux.org']})
df['head_c'], df['link_c'], df['span_c'] = [None, None, None]
# print(df)

async def process_url(df, url):
    async with aiohttp.ClientSession() as session:
        resp = await session.get(url)
        content = await resp.text()
        soup = BeautifulSoup(content, 'html.parser')
        headers_count = count_headers(soup)
        links_count = count_links(soup)
        spans_count = count_spans(soup)
        print("Done")

        df.loc[df['url'] == url, ['head_c', 'link_c', 'span_c']] = \
            [[headers_count, links_count, spans_count]]


async def main(df):
    await asyncio.gather(*[process_url(df, url) for url in df['url']])
    print(df)


loop = asyncio.get_event_loop()
loop.run_until_complete(main(df))
loop.close()

The output:

Done
Done
Done
   id                                  url  head_c  link_c  span_c
0   1  https://stackoverflow.com/questions      25     306       0
1   2                 https://facebook.com       3      55       0
2   3           https://wiki.archlinux.org      15      91       0

Enjoy the performance difference.

like image 154
RomanPerekhrest Avatar answered Oct 21 '22 06:10

RomanPerekhrest


Another idea for an optimized apply for Pandas is to use Swifter https://github.com/jmcarpenter2/swifter

like image 1
msarafzadeh Avatar answered Oct 21 '22 05:10

msarafzadeh