Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest and most efficient way to append rows to a DataFrame?

I have a large dataset which I have to convert to .csv format, it consists of 29 columns and 1M+ lines. I figured that as the dataframe gets larger, appending any rows to it is getting more and more time consuming. I wonder if there is any faster way to this, sharing the relevant snippet from the code.

Any recommendations are welcome though.


df = DataFrame()

for startID in range(0, 100000, 1000):
    s1 = time.time()
    tempdf = DataFrame()
    url = f'https://******/products?startId={startID}&size=1000'

    r = requests.get(url, headers={'****-Token': 'xxxxxx', 'Merchant-Id': '****'})
    jsonList = r.json()  # datatype= list, contains= dict

    normalized = json_normalize(jsonList)
    # type(normal) = pandas.DataFrame
    print(startID / 1000) # status indicator
    for series in normalized.iterrows():  
        series = series[1] # iterrows returns tuple (index, series)
        offers = series['offers']
        series = series.drop(columns='offers')
        length = len(offers)

        for offer in offers:
            n = json_normalize(offer).squeeze()  # squeeze() casts DataFrame into Series
            concatinated = concat([series, n]).to_frame().transpose()
            tempdf = tempdf.append(concatinated, ignore_index=True)

    del normalized
    df = df.append(tempdf)
    f1 = time.time()
    print(f1 - s1, ' seconds')

df.to_csv('out.csv')
like image 538
Erdal Dogan Avatar asked Jul 12 '19 05:07

Erdal Dogan


People also ask

Which is faster concat or append?

concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.

Is appending to a DataFrame slow?

Appending rows to DataFrames is inherently inefficient. Try to create the entire DataFrame with its final size in one go.

Which is faster than pandas?

On joining two datasets task, Polars has done it in 43 seconds. Meanwhile, Pandas did it in 628 seconds. We can see that Polars is almost 15 times faster than Pandas.

Which method will be used to append DataFrame?

Pandas DataFrame append() Method The append() method appends a DataFrame-like object at the end of the current DataFrame. The append() method returns a new DataFrame object, no changes are done with the original DataFrame.


1 Answers

As Mohit Motwani suggested fastest way is to collect data into dictionary then load all into data frame. Below some speed measurements examples:

import pandas as pd
import numpy as np
import time
import random

end_value = 10000

Measurement for creating a list of dictionaries and at the end load all into data frame

start_time = time.time()
dictinary_list = []
for i in range(0, end_value, 1):
    dictionary_data = {k: random.random() for k in range(30)}
    dictionary_list.append(dictionary_data)

df_final = pd.DataFrame.from_dict(dictionary_list)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 0.090153 seconds

Measurements for appending data into list and concat into data frame:

start_time = time.time()
appended_data = []
for i in range(0, end_value, 1):
    data = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
    appended_data.append(data)

appended_data = pd.concat(appended_data, axis=0)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 4.183921 seconds

Measurements for appending data frames:

start_time = time.time()
df_final = pd.DataFrame()
for i in range(0, end_value, 1):
    df = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
    df_final = df_final.append(df)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 11.085888 seconds

Measurements for insert data by usage of loc:

start_time = time.time()
df = pd.DataFrame(columns=list('A'*30))
for i in range(0, end_value, 1):
    df.loc[i] = list(np.random.randint(0, 100, size=30))


end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 21.029176 seconds

like image 58
Zaraki Kenpachi Avatar answered Oct 11 '22 10:10

Zaraki Kenpachi