What is the fastest and most efficient way to append rows to a DataFrame?

Tags:

I have a large dataset which I have to convert to .csv format, it consists of 29 columns and 1M+ lines. I figured that as the dataframe gets larger, appending any rows to it is getting more and more time consuming. I wonder if there is any faster way to this, sharing the relevant snippet from the code.

Any recommendations are welcome though.


df = DataFrame()

for startID in range(0, 100000, 1000):
    s1 = time.time()
    tempdf = DataFrame()
    url = f'https://******/products?startId={startID}&size=1000'

    r = requests.get(url, headers={'****-Token': 'xxxxxx', 'Merchant-Id': '****'})
    jsonList = r.json()  # datatype= list, contains= dict

    normalized = json_normalize(jsonList)
    # type(normal) = pandas.DataFrame
    print(startID / 1000) # status indicator
    for series in normalized.iterrows():  
        series = series[1] # iterrows returns tuple (index, series)
        offers = series['offers']
        series = series.drop(columns='offers')
        length = len(offers)

        for offer in offers:
            n = json_normalize(offer).squeeze()  # squeeze() casts DataFrame into Series
            concatinated = concat([series, n]).to_frame().transpose()
            tempdf = tempdf.append(concatinated, ignore_index=True)

    del normalized
    df = df.append(tempdf)
    f1 = time.time()
    print(f1 - s1, ' seconds')

df.to_csv('out.csv')

538

asked Jul 12 '19 05:07

Erdal Dogan

1 Answers

As Mohit Motwani suggested fastest way is to collect data into dictionary then load all into data frame. Below some speed measurements examples:

import pandas as pd
import numpy as np
import time
import random

end_value = 10000

Measurement for creating a list of dictionaries and at the end load all into data frame

start_time = time.time()
dictinary_list = []
for i in range(0, end_value, 1):
    dictionary_data = {k: random.random() for k in range(30)}
    dictionary_list.append(dictionary_data)

df_final = pd.DataFrame.from_dict(dictionary_list)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 0.090153 seconds

Measurements for appending data into list and concat into data frame:

start_time = time.time()
appended_data = []
for i in range(0, end_value, 1):
    data = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
    appended_data.append(data)

appended_data = pd.concat(appended_data, axis=0)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 4.183921 seconds

Measurements for appending data frames:

start_time = time.time()
df_final = pd.DataFrame()
for i in range(0, end_value, 1):
    df = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
    df_final = df_final.append(df)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 11.085888 seconds

Measurements for insert data by usage of loc:

start_time = time.time()
df = pd.DataFrame(columns=list('A'*30))
for i in range(0, end_value, 1):
    df.loc[i] = list(np.random.randint(0, 100, size=30))


end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 21.029176 seconds

answered Oct 11 '22 10:10

Zaraki Kenpachi

Related questions
                            
                                How to compare the modified date of two files in python?
                            
                                Python 2.7: %d, %s, and float()
                            
                                Flatten a list of strings and lists of strings and lists in Python [duplicate]
                            
                                Deleting certain elements from numpy array using conditional checks
                            
                                Read file with timeout in Python
                            
                                numpy.polyfit doesn't handle NaN values
                            
                                "Expected type 'Union[str, bytearray]' got 'int' instead" warning in write method
                            
                                Python re.split() vs nltk word_tokenize and sent_tokenize
                            
                                Inverting a dictionary with list values
                            
                                How to add a custom function/method in sqlalchemy model to do CRUD operations?
                            
                                Quick way to access first element in Numpy array with arbitrary number of dimensions?
                            
                                How do I identify sequences of values in a boolean array?
                            
                                How to add a line of best fit to scatter plot
                            
                                Pythonic way to chain python generator function to form a pipeline
                            
                                Pip install error missing 'libxml/xmlversion.h'
                            
                                Raise 404 if Flask catch-all route starts with prefix
                            
                                Python change pitch of wav file [closed]
                            
                                Keras LSTM input dimension setting
                            
                                Create a log file
                            
                                Error: could not determine PostgreSQL version from "10.4"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest and most efficient way to append rows to a DataFrame?

Tags:

python

python-3.x

pandas

dataframe

series

Erdal Dogan

People also ask

1 Answers

Zaraki Kenpachi

Recent Activity

Donate For Us