I have a function which processes a DataFrame, largely to process data into buckets create a binary matrix of features in a particular column using pd.get_dummies(df[col])
.
To avoid processing all of my data using this function at once (which goes out of memory and causes iPython to crash), I have broken the large DataFrame into chunks using:
chunks = (len(df) / 10000) + 1 df_list = np.array_split(df, chunks)
pd.get_dummies(df)
will automatically create new columns based on the contents of df[col]
and these are likely to differ for each df
in df_list
.
After processing, I am concatenating the DataFrames back together using:
for i, df_chunk in enumerate(df_list): print "chunk", i [x, y] = preprocess_data(df_chunk) super_x = pd.concat([super_x, x], axis=0) super_y = pd.concat([super_y, y], axis=0) print datetime.datetime.utcnow()
The processing time of the first chunk is perfectly acceptable, however, it grows per chunk! This is not to do with the preprocess_data(df_chunk)
as there is no reason for it to increase. Is this increase in time occurring as a result of the call to pd.concat()
?
Please see log below:
chunks 6 chunk 0 2016-04-08 00:22:17.728849 chunk 1 2016-04-08 00:22:42.387693 chunk 2 2016-04-08 00:23:43.124381 chunk 3 2016-04-08 00:25:30.249369 chunk 4 2016-04-08 00:28:11.922305 chunk 5 2016-04-08 00:32:00.357365
Is there a workaround to speed this up? I have 2900 chunks to process so any help is appreciated!
Open to any other suggestions in Python!
Pandas DataFrames are fantastic. However, concatenating them using standard approaches, such as pandas. concat() , can be very slow with large dataframes. This is a work around for that problem.
Appending rows to DataFrames is inherently inefficient. Try to create the entire DataFrame with its final size in one go.
Append function will add rows of second data frame to first dataframe iteratively one by one. Concat function will do a single operation to finish the job, which makes it faster than append().
concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.
Never call DataFrame.append
or pd.concat
inside a for-loop. It leads to quadratic copying.
pd.concat
returns a new DataFrame. Space has to be allocated for the new DataFrame, and data from the old DataFrames have to be copied into the new DataFrame. Consider the amount of copying required by this line inside the for-loop
(assuming each x
has size 1):
super_x = pd.concat([super_x, x], axis=0) | iteration | size of old super_x | size of x | copying required | | 0 | 0 | 1 | 1 | | 1 | 1 | 1 | 2 | | 2 | 2 | 1 | 3 | | ... | | | | | N-1 | N-1 | 1 | N |
1 + 2 + 3 + ... + N = N(N+1)/2
. So there is O(N**2)
copies required to complete the loop.
Now consider
super_x = [] for i, df_chunk in enumerate(df_list): [x, y] = preprocess_data(df_chunk) super_x.append(x) super_x = pd.concat(super_x, axis=0)
Appending to a list is an O(1)
operation and does not require copying. Now there is a single call to pd.concat
after the loop is done. This call to pd.concat
requires N copies to be made, since super_x
contains N
DataFrames of size 1. So when constructed this way, super_x
requires O(N)
copies.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With