I have a pandas dataframe and on each row, I would like to execute a function. However, the function includes I/O call to a remote server and thus it is very slow if I call it simply using .apply()
to the dataframe.
Here is an example:
def f(data):
r = requests.get(data["url"])
x = process(r.content)
y = process_2(r.content)
z = process_3(r.content)
print("Done")
return [x, y, z]
df.apply(lambda x: f(x), axis=1)
In this code, the problem is requests.get(data["url"])
takes a while and thus the entire apply()
function is very slow at finishing. The print()
is printed on the console at a few seconds of interval.
Is it possible to execute the apply()
function asynchronously and get the result faster? My dataframe has 5,000+ rows and the function call to each raw should take a few seconds.
The apply() function is used to apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame's index (axis=0) or the DataFrame's columns (axis=1).
DataFrame - to_pickle() functionThe to_pickle() function is used to pickle (serialize) object to file. File path where the pickled object will be stored. A string representing the compression to use in the output file. By default, infers from the file extension in specified path.
First, pandas is single threaded, meaning that it cannot leverage multiple cores in a machine or cluster.
Asynchronous I/O approach with well-known asyncio + aiohttp libraries:
Demonstrated on sample Dataframe and simple webpage content processing routines (to show the mechanics of the approach).
Let's say we need to count all header, link(<a>
) and span tags through all urls and store the resulting counters in the source dataframe.
import pandas as pd
import asyncio
import aiohttp
from bs4 import BeautifulSoup
def count_headers(html):
return len(list(html.select('h1,h2,h3,h4,h5,h6')))
def count_links(html):
return len(list(html.find_all('a')))
def count_spans(html):
return len(list(html.find_all('spans')))
df = pd.DataFrame({'id': [1, 2, 3], 'url': ['https://stackoverflow.com/questions',
'https://facebook.com',
'https://wiki.archlinux.org']})
df['head_c'], df['link_c'], df['span_c'] = [None, None, None]
# print(df)
async def process_url(df, url):
async with aiohttp.ClientSession() as session:
resp = await session.get(url)
content = await resp.text()
soup = BeautifulSoup(content, 'html.parser')
headers_count = count_headers(soup)
links_count = count_links(soup)
spans_count = count_spans(soup)
print("Done")
df.loc[df['url'] == url, ['head_c', 'link_c', 'span_c']] = \
[[headers_count, links_count, spans_count]]
async def main(df):
await asyncio.gather(*[process_url(df, url) for url in df['url']])
print(df)
loop = asyncio.get_event_loop()
loop.run_until_complete(main(df))
loop.close()
The output:
Done
Done
Done
id url head_c link_c span_c
0 1 https://stackoverflow.com/questions 25 306 0
1 2 https://facebook.com 3 55 0
2 3 https://wiki.archlinux.org 15 91 0
Enjoy the performance difference.
Another idea for an optimized apply for Pandas is to use Swifter https://github.com/jmcarpenter2/swifter
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With