I am confused by the performance in Pandas when building a large dataframe chunk by chunk. In Numpy, we (almost) always see better performance by preallocating a large empty array and then filling in the values. As I understand it, this is due to Numpy grabbing all the memory it needs at once instead of having to reallocate memory with every append
operation.
In Pandas, I seem to be getting better performance by using the df = df.append(temp)
pattern.
Here is an example with timing. The definition of the Timer
class follows. As you, see I find that preallocating is roughly 10x slower than using append
! Preallocating a dataframe with np.empty
values of the appropriate dtype helps a great deal, but the append
method is still the fastest.
import numpy as np from numpy.random import rand import pandas as pd from timer import Timer # Some constants num_dfs = 10 # Number of random dataframes to generate n_rows = 2500 n_cols = 40 n_reps = 100 # Number of repetitions for timing # Generate a list of num_dfs dataframes of random values df_list = [pd.DataFrame(rand(n_rows*n_cols).reshape((n_rows, n_cols)), columns=np.arange(n_cols)) for i in np.arange(num_dfs)] ## # Define two methods of growing a large dataframe ## # Method 1 - append dataframes def method1(): out_df1 = pd.DataFrame(columns=np.arange(4)) for df in df_list: out_df1 = out_df1.append(df, ignore_index=True) return out_df1 def method2(): # # Create an empty dataframe that is big enough to hold all the dataframes in df_list out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows)) #EDIT_1: Set the dtypes of each column for ix, col in enumerate(out_df2.columns): out_df2[col] = out_df2[col].astype(df_list[0].dtypes[ix]) # Fill in the values for ix, df in enumerate(df_list): out_df2.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values return out_df2 # EDIT_2: # Method 3 - preallocate dataframe with np.empty data of appropriate type def method3(): # Create fake data array data = np.transpose(np.array([np.empty(n_rows*num_dfs, dtype=dt) for dt in df_list[0].dtypes])) # Create placeholder dataframe out_df3 = pd.DataFrame(data) # Fill in the real values for ix, df in enumerate(df_list): out_df3.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values return out_df3 ## # Time both methods ## # Time Method 1 times_1 = np.empty(n_reps) for i in np.arange(n_reps): with Timer() as t: df1 = method1() times_1[i] = t.secs print 'Total time for %d repetitions of Method 1: %f [sec]' % (n_reps, np.sum(times_1)) print 'Best time: %f' % (np.min(times_1)) print 'Mean time: %f' % (np.mean(times_1)) #>> Total time for 100 repetitions of Method 1: 2.928296 [sec] #>> Best time: 0.028532 #>> Mean time: 0.029283 # Time Method 2 times_2 = np.empty(n_reps) for i in np.arange(n_reps): with Timer() as t: df2 = method2() times_2[i] = t.secs print 'Total time for %d repetitions of Method 2: %f [sec]' % (n_reps, np.sum(times_2)) print 'Best time: %f' % (np.min(times_2)) print 'Mean time: %f' % (np.mean(times_2)) #>> Total time for 100 repetitions of Method 2: 32.143247 [sec] #>> Best time: 0.315075 #>> Mean time: 0.321432 # Time Method 3 times_3 = np.empty(n_reps) for i in np.arange(n_reps): with Timer() as t: df3 = method3() times_3[i] = t.secs print 'Total time for %d repetitions of Method 3: %f [sec]' % (n_reps, np.sum(times_3)) print 'Best time: %f' % (np.min(times_3)) print 'Mean time: %f' % (np.mean(times_3)) #>> Total time for 100 repetitions of Method 3: 6.577038 [sec] #>> Best time: 0.063437 #>> Mean time: 0.065770
I use a nice Timer
courtesy of Huy Nguyen:
# credit: http://www.huyng.com/posts/python-performance-analysis/ import time class Timer(object): def __init__(self, verbose=False): self.verbose = verbose def __enter__(self): self.start = time.clock() return self def __exit__(self, *args): self.end = time.clock() self.secs = self.end - self.start self.msecs = self.secs * 1000 # millisecs if self.verbose: print 'elapsed time: %f ms' % self.msecs
If you are still following, I have two questions:
1) Why is the append
method faster? (NOTE: for very small dataframes, i.e. n_rows = 40
, it is actually slower).
2) What is the most efficient way to build a large dataframe out of chunks? (In my case, the chunks are all large csv files).
Thanks for your help!
EDIT_1: In my real world project, the columns have different dtypes. So I cannot use the pd.DataFrame(.... dtype=some_type)
trick to improve the performance of preallocation, per BrenBarn's recommendation. The dtype parameter forces all the columns to be the same dtype [Ref. issue 4464]
I added some lines to method2()
in my code to change the dtypes column-by-column to match in the input dataframes. This operation is expensive and negates the benefits of having the appropriate dtypes when writing blocks of rows.
EDIT_2: Try preallocating a dataframe using placeholder array np.empty(... dtyp=some_type)
. Per @Joris's suggestion.
In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.
The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
You can't have a DataFrame larger than your machine's RAM. In practice, your available RAM should be several times the size of your dataset, as you or pandas will have to make intermediate copies as part of the analysis.
Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.
Your benchmark is actually too small to show the real difference. Appending, copies EACH time, so you are actually doing copying a size N memory space N*(N-1) times. This is horribly inefficient as the size of your dataframe grows. This certainly might not matter in a very small frame. But if you have any real size this matters a lot. This is specifically noted in the docs here, though kind of a small warning.
In [97]: df = DataFrame(np.random.randn(100000,20)) In [98]: df['B'] = 'foo' In [99]: df['C'] = pd.Timestamp('20130101') In [103]: df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 100000 entries, 0 to 99999 Data columns (total 22 columns): 0 100000 non-null float64 1 100000 non-null float64 2 100000 non-null float64 3 100000 non-null float64 4 100000 non-null float64 5 100000 non-null float64 6 100000 non-null float64 7 100000 non-null float64 8 100000 non-null float64 9 100000 non-null float64 10 100000 non-null float64 11 100000 non-null float64 12 100000 non-null float64 13 100000 non-null float64 14 100000 non-null float64 15 100000 non-null float64 16 100000 non-null float64 17 100000 non-null float64 18 100000 non-null float64 19 100000 non-null float64 B 100000 non-null object C 100000 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(20), object(1) memory usage: 17.5+ MB
Appending
In [85]: def f1(): ....: result = df ....: for i in range(9): ....: result = result.append(df) ....: return result ....:
Concat
In [86]: def f2(): ....: result = [] ....: for i in range(10): ....: result.append(df) ....: return pd.concat(result) ....: In [100]: f1().equals(f2()) Out[100]: True In [101]: %timeit f1() 1 loops, best of 3: 1.66 s per loop In [102]: %timeit f2() 1 loops, best of 3: 220 ms per loop
Note that I wouldn't even bother trying to pre-allocate. Its somewhat complicated, especially since you are dealing with multiple dtypes (e.g. you could make a giant frame and simply .loc
and it would work). But pd.concat
is just dead simple, works reliably, and fast.
And timing of your sizes from above
In [104]: df = DataFrame(np.random.randn(2500,40)) In [105]: %timeit f1() 10 loops, best of 3: 33.1 ms per loop In [106]: %timeit f2() 100 loops, best of 3: 4.23 ms per loop
@Jeff, pd.concat
wins by a mile! I benchmarked a fourth method using pd.concat
with num_dfs = 500
. The results are unequivocal:
The method4()
definition:
# Method 4 - us pd.concat on df_list def method4(): return pd.concat(df_list, ignore_index=True)
Profiling results, using the same Timer
in my original question:
Total time for 100 repetitions of Method 1: 3679.334655 [sec] Best time: 35.570036 Mean time: 36.793347 Total time for 100 repetitions of Method 2: 1569.917425 [sec] Best time: 15.457102 Mean time: 15.699174 Total time for 100 repetitions of Method 3: 325.730455 [sec] Best time: 3.192702 Mean time: 3.257305 Total time for 100 repetitions of Method 4: 25.448473 [sec] Best time: 0.244309 Mean time: 0.254485
The pd.concat
method is 13x faster than preallocating with a np.empty(... dtype)
palceholder.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With