I have a large dataset which I have to convert to .csv format, it consists of 29 columns and 1M+ lines. I figured that as the dataframe gets larger, appending any rows to it is getting more and more time consuming. I wonder if there is any faster way to this, sharing the relevant snippet from the code.
Any recommendations are welcome though.
df = DataFrame()
for startID in range(0, 100000, 1000):
s1 = time.time()
tempdf = DataFrame()
url = f'https://******/products?startId={startID}&size=1000'
r = requests.get(url, headers={'****-Token': 'xxxxxx', 'Merchant-Id': '****'})
jsonList = r.json() # datatype= list, contains= dict
normalized = json_normalize(jsonList)
# type(normal) = pandas.DataFrame
print(startID / 1000) # status indicator
for series in normalized.iterrows():
series = series[1] # iterrows returns tuple (index, series)
offers = series['offers']
series = series.drop(columns='offers')
length = len(offers)
for offer in offers:
n = json_normalize(offer).squeeze() # squeeze() casts DataFrame into Series
concatinated = concat([series, n]).to_frame().transpose()
tempdf = tempdf.append(concatinated, ignore_index=True)
del normalized
df = df.append(tempdf)
f1 = time.time()
print(f1 - s1, ' seconds')
df.to_csv('out.csv')
concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.
Appending rows to DataFrames is inherently inefficient. Try to create the entire DataFrame with its final size in one go.
On joining two datasets task, Polars has done it in 43 seconds. Meanwhile, Pandas did it in 628 seconds. We can see that Polars is almost 15 times faster than Pandas.
Pandas DataFrame append() Method The append() method appends a DataFrame-like object at the end of the current DataFrame. The append() method returns a new DataFrame object, no changes are done with the original DataFrame.
As Mohit Motwani suggested fastest way is to collect data into dictionary then load all into data frame. Below some speed measurements examples:
import pandas as pd
import numpy as np
import time
import random
end_value = 10000
Measurement for creating a list of dictionaries and at the end load all into data frame
start_time = time.time()
dictinary_list = []
for i in range(0, end_value, 1):
dictionary_data = {k: random.random() for k in range(30)}
dictionary_list.append(dictionary_data)
df_final = pd.DataFrame.from_dict(dictionary_list)
end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))
Execution time = 0.090153 seconds
Measurements for appending data into list and concat into data frame:
start_time = time.time()
appended_data = []
for i in range(0, end_value, 1):
data = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
appended_data.append(data)
appended_data = pd.concat(appended_data, axis=0)
end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))
Execution time = 4.183921 seconds
Measurements for appending data frames:
start_time = time.time()
df_final = pd.DataFrame()
for i in range(0, end_value, 1):
df = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
df_final = df_final.append(df)
end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))
Execution time = 11.085888 seconds
Measurements for insert data by usage of loc:
start_time = time.time()
df = pd.DataFrame(columns=list('A'*30))
for i in range(0, end_value, 1):
df.loc[i] = list(np.random.randint(0, 100, size=30))
end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))
Execution time = 21.029176 seconds
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With