Creating large Pandas DataFrames: preallocation vs append vs concat

Tags:

I am confused by the performance in Pandas when building a large dataframe chunk by chunk. In Numpy, we (almost) always see better performance by preallocating a large empty array and then filling in the values. As I understand it, this is due to Numpy grabbing all the memory it needs at once instead of having to reallocate memory with every append operation.

In Pandas, I seem to be getting better performance by using the df = df.append(temp) pattern.

Here is an example with timing. The definition of the Timer class follows. As you, see I find that preallocating is roughly 10x slower than using append! Preallocating a dataframe with np.empty values of the appropriate dtype helps a great deal, but the append method is still the fastest.

import numpy as np from numpy.random import rand import pandas as pd  from timer import Timer  # Some constants num_dfs = 10  # Number of random dataframes to generate n_rows = 2500 n_cols = 40 n_reps = 100  # Number of repetitions for timing  # Generate a list of num_dfs dataframes of random values df_list = [pd.DataFrame(rand(n_rows*n_cols).reshape((n_rows, n_cols)), columns=np.arange(n_cols)) for i in np.arange(num_dfs)]  ## # Define two methods of growing a large dataframe ##  # Method 1 - append dataframes def method1():     out_df1 = pd.DataFrame(columns=np.arange(4))     for df in df_list:         out_df1 = out_df1.append(df, ignore_index=True)     return out_df1  def method2(): # # Create an empty dataframe that is big enough to hold all the dataframes in df_list out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows)) #EDIT_1: Set the dtypes of each column for ix, col in enumerate(out_df2.columns):     out_df2[col] = out_df2[col].astype(df_list[0].dtypes[ix]) # Fill in the values for ix, df in enumerate(df_list):     out_df2.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values return out_df2  # EDIT_2:  # Method 3 - preallocate dataframe with np.empty data of appropriate type def method3():     # Create fake data array     data = np.transpose(np.array([np.empty(n_rows*num_dfs, dtype=dt) for dt in df_list[0].dtypes]))     # Create placeholder dataframe     out_df3 = pd.DataFrame(data)     # Fill in the real values     for ix, df in enumerate(df_list):         out_df3.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values     return out_df3  ## # Time both methods ##  # Time Method 1 times_1 = np.empty(n_reps) for i in np.arange(n_reps):     with Timer() as t:        df1 = method1()     times_1[i] = t.secs print 'Total time for %d repetitions of Method 1: %f [sec]' % (n_reps, np.sum(times_1)) print 'Best time: %f' % (np.min(times_1)) print 'Mean time: %f' % (np.mean(times_1))  #>>  Total time for 100 repetitions of Method 1: 2.928296 [sec] #>>  Best time: 0.028532 #>>  Mean time: 0.029283  # Time Method 2 times_2 = np.empty(n_reps) for i in np.arange(n_reps):     with Timer() as t:         df2 = method2()     times_2[i] = t.secs print 'Total time for %d repetitions of Method 2: %f [sec]' % (n_reps, np.sum(times_2)) print 'Best time: %f' % (np.min(times_2)) print 'Mean time: %f' % (np.mean(times_2))  #>>  Total time for 100 repetitions of Method 2: 32.143247 [sec] #>>  Best time: 0.315075 #>>  Mean time: 0.321432  # Time Method 3 times_3 = np.empty(n_reps) for i in np.arange(n_reps):     with Timer() as t:         df3 = method3()     times_3[i] = t.secs print 'Total time for %d repetitions of Method 3: %f [sec]' % (n_reps, np.sum(times_3)) print 'Best time: %f' % (np.min(times_3)) print 'Mean time: %f' % (np.mean(times_3))  #>>  Total time for 100 repetitions of Method 3: 6.577038 [sec] #>>  Best time: 0.063437 #>>  Mean time: 0.065770

I use a nice Timer courtesy of Huy Nguyen:

# credit: http://www.huyng.com/posts/python-performance-analysis/  import time  class Timer(object):     def __init__(self, verbose=False):         self.verbose = verbose      def __enter__(self):         self.start = time.clock()         return self      def __exit__(self, *args):         self.end = time.clock()         self.secs = self.end - self.start         self.msecs = self.secs * 1000  # millisecs         if self.verbose:             print 'elapsed time: %f ms' % self.msecs

If you are still following, I have two questions:

1) Why is the append method faster? (NOTE: for very small dataframes, i.e. n_rows = 40, it is actually slower).

2) What is the most efficient way to build a large dataframe out of chunks? (In my case, the chunks are all large csv files).

Thanks for your help!

EDIT_1: In my real world project, the columns have different dtypes. So I cannot use the pd.DataFrame(.... dtype=some_type) trick to improve the performance of preallocation, per BrenBarn's recommendation. The dtype parameter forces all the columns to be the same dtype [Ref. issue 4464]

I added some lines to method2() in my code to change the dtypes column-by-column to match in the input dataframes. This operation is expensive and negates the benefits of having the appropriate dtypes when writing blocks of rows.

EDIT_2: Try preallocating a dataframe using placeholder array np.empty(... dtyp=some_type). Per @Joris's suggestion.

585

asked Jul 29 '15 02:07

andrew

2 Answers

Your benchmark is actually too small to show the real difference. Appending, copies EACH time, so you are actually doing copying a size N memory space N*(N-1) times. This is horribly inefficient as the size of your dataframe grows. This certainly might not matter in a very small frame. But if you have any real size this matters a lot. This is specifically noted in the docs here, though kind of a small warning.

In [97]: df = DataFrame(np.random.randn(100000,20))  In [98]: df['B'] = 'foo'  In [99]: df['C'] = pd.Timestamp('20130101')  In [103]: df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 100000 entries, 0 to 99999 Data columns (total 22 columns): 0     100000 non-null float64 1     100000 non-null float64 2     100000 non-null float64 3     100000 non-null float64 4     100000 non-null float64 5     100000 non-null float64 6     100000 non-null float64 7     100000 non-null float64 8     100000 non-null float64 9     100000 non-null float64 10    100000 non-null float64 11    100000 non-null float64 12    100000 non-null float64 13    100000 non-null float64 14    100000 non-null float64 15    100000 non-null float64 16    100000 non-null float64 17    100000 non-null float64 18    100000 non-null float64 19    100000 non-null float64 B     100000 non-null object C     100000 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(20), object(1) memory usage: 17.5+ MB

Appending

In [85]: def f1():    ....:     result = df    ....:     for i in range(9):    ....:         result = result.append(df)    ....:     return result    ....:

Concat

In [86]: def f2():    ....:     result = []    ....:     for i in range(10):    ....:         result.append(df)    ....:     return pd.concat(result)    ....:   In [100]: f1().equals(f2()) Out[100]: True  In [101]: %timeit f1() 1 loops, best of 3: 1.66 s per loop  In [102]: %timeit f2() 1 loops, best of 3: 220 ms per loop

Note that I wouldn't even bother trying to pre-allocate. Its somewhat complicated, especially since you are dealing with multiple dtypes (e.g. you could make a giant frame and simply .loc and it would work). But pd.concat is just dead simple, works reliably, and fast.

And timing of your sizes from above

In [104]: df = DataFrame(np.random.randn(2500,40))  In [105]: %timeit f1() 10 loops, best of 3: 33.1 ms per loop  In [106]: %timeit f2() 100 loops, best of 3: 4.23 ms per loop

175

answered Oct 18 '22 01:10

Jeff

@Jeff, pd.concat wins by a mile! I benchmarked a fourth method using pd.concat with num_dfs = 500. The results are unequivocal:

The method4() definition:

# Method 4 - us pd.concat on df_list def method4(): return pd.concat(df_list, ignore_index=True)

Profiling results, using the same Timer in my original question:

Total time for 100 repetitions of Method 1: 3679.334655 [sec] Best time: 35.570036 Mean time: 36.793347 Total time for 100 repetitions of Method 2: 1569.917425 [sec] Best time: 15.457102 Mean time: 15.699174 Total time for 100 repetitions of Method 3: 325.730455 [sec] Best time: 3.192702 Mean time: 3.257305 Total time for 100 repetitions of Method 4: 25.448473 [sec] Best time: 0.244309 Mean time: 0.254485

The pd.concat method is 13x faster than preallocating with a np.empty(... dtype) palceholder.

answered Oct 17 '22 23:10

andrew

Related questions
                            
                                104, 'Connection reset by peer' socket error, or When does closing a socket result in a RST rather than FIN?
                            
                                Plotting results of Pandas GroupBy
                            
                                Closest equivalent of a factor variable in Python Pandas
                            
                                What's the difference between pandas ACF and statsmodel ACF?
                            
                                PHP equivalent of Python's __name__ == "__main__"?
                            
                                How to Reduce the time taken to load a pickle file in python
                            
                                Jupyter (IPython) notebook: Convert an HTML notebook to ipynb
                            
                                Using subprocess.Popen for Process with Large Output
                            
                                Do I have to do StringIO.close()?
                            
                                How can I access a classmethod from inside a class in Python
                            
                                how to apply a mask from one array to another array?
                            
                                Fastest way to process a large file?
                            
                                Python metaclasses vs class decorators
                            
                                Python URLLib / URLLib2 POST
                            
                                Relative importing modules from parent folder subfolder
                            
                                What does the Python version line mean?
                            
                                How do I make pytest fixtures work with decorated functions?
                            
                                Why do we need locks for threads, if we have GIL?
                            
                                Should enum instances be compared by identity or equality?
                            
                                How to multiply two vector and get a matrix?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Creating large Pandas DataFrames: preallocation vs append vs concat

Tags:

python

pandas

andrew

People also ask

2 Answers

Jeff

andrew

Recent Activity

Donate For Us