Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating large Pandas DataFrames: preallocation vs append vs concat

Tags:

python

pandas

I am confused by the performance in Pandas when building a large dataframe chunk by chunk. In Numpy, we (almost) always see better performance by preallocating a large empty array and then filling in the values. As I understand it, this is due to Numpy grabbing all the memory it needs at once instead of having to reallocate memory with every append operation.

In Pandas, I seem to be getting better performance by using the df = df.append(temp) pattern.

Here is an example with timing. The definition of the Timer class follows. As you, see I find that preallocating is roughly 10x slower than using append! Preallocating a dataframe with np.empty values of the appropriate dtype helps a great deal, but the append method is still the fastest.

import numpy as np from numpy.random import rand import pandas as pd  from timer import Timer  # Some constants num_dfs = 10  # Number of random dataframes to generate n_rows = 2500 n_cols = 40 n_reps = 100  # Number of repetitions for timing  # Generate a list of num_dfs dataframes of random values df_list = [pd.DataFrame(rand(n_rows*n_cols).reshape((n_rows, n_cols)), columns=np.arange(n_cols)) for i in np.arange(num_dfs)]  ## # Define two methods of growing a large dataframe ##  # Method 1 - append dataframes def method1():     out_df1 = pd.DataFrame(columns=np.arange(4))     for df in df_list:         out_df1 = out_df1.append(df, ignore_index=True)     return out_df1  def method2(): # # Create an empty dataframe that is big enough to hold all the dataframes in df_list out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows)) #EDIT_1: Set the dtypes of each column for ix, col in enumerate(out_df2.columns):     out_df2[col] = out_df2[col].astype(df_list[0].dtypes[ix]) # Fill in the values for ix, df in enumerate(df_list):     out_df2.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values return out_df2  # EDIT_2:  # Method 3 - preallocate dataframe with np.empty data of appropriate type def method3():     # Create fake data array     data = np.transpose(np.array([np.empty(n_rows*num_dfs, dtype=dt) for dt in df_list[0].dtypes]))     # Create placeholder dataframe     out_df3 = pd.DataFrame(data)     # Fill in the real values     for ix, df in enumerate(df_list):         out_df3.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values     return out_df3  ## # Time both methods ##  # Time Method 1 times_1 = np.empty(n_reps) for i in np.arange(n_reps):     with Timer() as t:        df1 = method1()     times_1[i] = t.secs print 'Total time for %d repetitions of Method 1: %f [sec]' % (n_reps, np.sum(times_1)) print 'Best time: %f' % (np.min(times_1)) print 'Mean time: %f' % (np.mean(times_1))  #>>  Total time for 100 repetitions of Method 1: 2.928296 [sec] #>>  Best time: 0.028532 #>>  Mean time: 0.029283  # Time Method 2 times_2 = np.empty(n_reps) for i in np.arange(n_reps):     with Timer() as t:         df2 = method2()     times_2[i] = t.secs print 'Total time for %d repetitions of Method 2: %f [sec]' % (n_reps, np.sum(times_2)) print 'Best time: %f' % (np.min(times_2)) print 'Mean time: %f' % (np.mean(times_2))  #>>  Total time for 100 repetitions of Method 2: 32.143247 [sec] #>>  Best time: 0.315075 #>>  Mean time: 0.321432  # Time Method 3 times_3 = np.empty(n_reps) for i in np.arange(n_reps):     with Timer() as t:         df3 = method3()     times_3[i] = t.secs print 'Total time for %d repetitions of Method 3: %f [sec]' % (n_reps, np.sum(times_3)) print 'Best time: %f' % (np.min(times_3)) print 'Mean time: %f' % (np.mean(times_3))  #>>  Total time for 100 repetitions of Method 3: 6.577038 [sec] #>>  Best time: 0.063437 #>>  Mean time: 0.065770 

I use a nice Timer courtesy of Huy Nguyen:

# credit: http://www.huyng.com/posts/python-performance-analysis/  import time  class Timer(object):     def __init__(self, verbose=False):         self.verbose = verbose      def __enter__(self):         self.start = time.clock()         return self      def __exit__(self, *args):         self.end = time.clock()         self.secs = self.end - self.start         self.msecs = self.secs * 1000  # millisecs         if self.verbose:             print 'elapsed time: %f ms' % self.msecs 

If you are still following, I have two questions:

1) Why is the append method faster? (NOTE: for very small dataframes, i.e. n_rows = 40, it is actually slower).

2) What is the most efficient way to build a large dataframe out of chunks? (In my case, the chunks are all large csv files).

Thanks for your help!

EDIT_1: In my real world project, the columns have different dtypes. So I cannot use the pd.DataFrame(.... dtype=some_type) trick to improve the performance of preallocation, per BrenBarn's recommendation. The dtype parameter forces all the columns to be the same dtype [Ref. issue 4464]

I added some lines to method2() in my code to change the dtypes column-by-column to match in the input dataframes. This operation is expensive and negates the benefits of having the appropriate dtypes when writing blocks of rows.

EDIT_2: Try preallocating a dataframe using placeholder array np.empty(... dtyp=some_type). Per @Joris's suggestion.

like image 585
andrew Avatar asked Jul 29 '15 02:07

andrew


People also ask

Which is faster pandas concat or append?

In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.

Is pandas efficient for large data sets?

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

How do I make my pandas DataFrame bigger?

You can't have a DataFrame larger than your machine's RAM. In practice, your available RAM should be several times the size of your dataset, as you or pandas will have to make intermediate copies as part of the analysis.

What is the difference between concat and merge pandas?

Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.


2 Answers

Your benchmark is actually too small to show the real difference. Appending, copies EACH time, so you are actually doing copying a size N memory space N*(N-1) times. This is horribly inefficient as the size of your dataframe grows. This certainly might not matter in a very small frame. But if you have any real size this matters a lot. This is specifically noted in the docs here, though kind of a small warning.

In [97]: df = DataFrame(np.random.randn(100000,20))  In [98]: df['B'] = 'foo'  In [99]: df['C'] = pd.Timestamp('20130101')  In [103]: df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 100000 entries, 0 to 99999 Data columns (total 22 columns): 0     100000 non-null float64 1     100000 non-null float64 2     100000 non-null float64 3     100000 non-null float64 4     100000 non-null float64 5     100000 non-null float64 6     100000 non-null float64 7     100000 non-null float64 8     100000 non-null float64 9     100000 non-null float64 10    100000 non-null float64 11    100000 non-null float64 12    100000 non-null float64 13    100000 non-null float64 14    100000 non-null float64 15    100000 non-null float64 16    100000 non-null float64 17    100000 non-null float64 18    100000 non-null float64 19    100000 non-null float64 B     100000 non-null object C     100000 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(20), object(1) memory usage: 17.5+ MB 

Appending

In [85]: def f1():    ....:     result = df    ....:     for i in range(9):    ....:         result = result.append(df)    ....:     return result    ....:  

Concat

In [86]: def f2():    ....:     result = []    ....:     for i in range(10):    ....:         result.append(df)    ....:     return pd.concat(result)    ....:   In [100]: f1().equals(f2()) Out[100]: True  In [101]: %timeit f1() 1 loops, best of 3: 1.66 s per loop  In [102]: %timeit f2() 1 loops, best of 3: 220 ms per loop 

Note that I wouldn't even bother trying to pre-allocate. Its somewhat complicated, especially since you are dealing with multiple dtypes (e.g. you could make a giant frame and simply .loc and it would work). But pd.concat is just dead simple, works reliably, and fast.

And timing of your sizes from above

In [104]: df = DataFrame(np.random.randn(2500,40))  In [105]: %timeit f1() 10 loops, best of 3: 33.1 ms per loop  In [106]: %timeit f2() 100 loops, best of 3: 4.23 ms per loop 
like image 175
Jeff Avatar answered Oct 18 '22 01:10

Jeff


@Jeff, pd.concat wins by a mile! I benchmarked a fourth method using pd.concat with num_dfs = 500. The results are unequivocal:

The method4() definition:

# Method 4 - us pd.concat on df_list def method4(): return pd.concat(df_list, ignore_index=True) 

Profiling results, using the same Timer in my original question:

Total time for 100 repetitions of Method 1: 3679.334655 [sec] Best time: 35.570036 Mean time: 36.793347 Total time for 100 repetitions of Method 2: 1569.917425 [sec] Best time: 15.457102 Mean time: 15.699174 Total time for 100 repetitions of Method 3: 325.730455 [sec] Best time: 3.192702 Mean time: 3.257305 Total time for 100 repetitions of Method 4: 25.448473 [sec] Best time: 0.244309 Mean time: 0.254485 

The pd.concat method is 13x faster than preallocating with a np.empty(... dtype) palceholder.

like image 30
andrew Avatar answered Oct 17 '22 23:10

andrew