Fastest way to loop over Pandas DataFrame for API calls

Question

My objective is to make a call to an API for each row in a Pandas DataFrame, which contains a List of strings in the response JSON, and creating a new DataFrame with one row per response. My code basically looks like this:

i = 0
new_df = pandas.DataFrame(columns = ['a','b','c','d'])
for index,row in df.iterrows():
    url = 'http://myAPI/'
    d = '{"SomeJSONData:"' + row['data'] + '}'
    j = json.loads(d)
    response = requests.post(url,json = j)

    data = response.json()
    for new_data in data['c']:
        new_df.loc[i] = [row['a'],row['b'],row['c'],new_data]
        i += 1

This works fine, but I'm making about 5500 API calls and writing about 6500 rows to the new DataFrame so it takes a while, maybe 10 minutes. I was wondering if anyone knew of a way to speed this up? I'm not too familiar with running parallel for loops in Python, could this be done while maintaining thread safety?

2yan · Accepted Answer

Something along these lines perhaps? This way you aren't creating a whole new dataframe, you're only declaring URL once, and you're taking advantage of the fact that pandas column operations are faster than row by row stuff.

url = 'http://myAPI/'

def request_function(j):
    return requests.post(url,json = json.loads(j))['c'] 

df['j']= '{"SomeJsonData:"' + df['data'] + '}'
df['new_data'] = df['j'].apply(request_function)

Now to prove that using apply in this case ( String data ) is indeed much faster, here's a simple test:

import numpy as np
import pandas as pd
import time

def func(text):
    return text + ' is processed'


def test_one():
    data =pd.DataFrame(columns = ['text'], index = np.arange(0, 100000))
    data['text'] = 'text'

    start = time.time()
    data['text'] = data['text'].apply(func)
    print(time.time() - start)


def test_two():
    data =pd.DataFrame(columns = ['text'], index = np.arange(0, 100000))
    data['text'] = 'text'

    start = time.time()

    for index, row in data.iterrows():
        data.loc[index, 'text'] = row['text'] + ' is processed'

    print(time.time() - start)

Results of string operations on dataframes.

test_one(using apply) : 0.023002147674560547

test_two(using iterrows): 18.912891149520874

Basically, by using the built-in pandas operations of adding the two columns and apply, you should have somewhat faster results, your response time is indeed limited by the API response time. If the results are still too slow, you might what to consider writing an async function that saves the results to a list. Then you send.apply that async function.

Fastest way to loop over Pandas DataFrame for API calls

Tags:

python

pandas

python-requests

Martin Boros

1 Answers

2yan

Recent Activity

Donate For Us

Fastest way to loop over Pandas DataFrame for API calls

Tags:

python

pandas

python-requests

Martin Boros

1 Answers

2yan

Related questions

Recent Activity

Donate For Us