Why does concatenation of DataFrames get exponentially slower?

Tags:

I have a function which processes a DataFrame, largely to process data into buckets create a binary matrix of features in a particular column using pd.get_dummies(df[col]).

To avoid processing all of my data using this function at once (which goes out of memory and causes iPython to crash), I have broken the large DataFrame into chunks using:

chunks = (len(df) / 10000) + 1 df_list = np.array_split(df, chunks)

pd.get_dummies(df) will automatically create new columns based on the contents of df[col] and these are likely to differ for each df in df_list.

After processing, I am concatenating the DataFrames back together using:

for i, df_chunk in enumerate(df_list):     print "chunk", i     [x, y] = preprocess_data(df_chunk)     super_x = pd.concat([super_x, x], axis=0)     super_y = pd.concat([super_y, y], axis=0)     print datetime.datetime.utcnow()

The processing time of the first chunk is perfectly acceptable, however, it grows per chunk! This is not to do with the preprocess_data(df_chunk) as there is no reason for it to increase. Is this increase in time occurring as a result of the call to pd.concat()?

Please see log below:

chunks 6 chunk 0 2016-04-08 00:22:17.728849 chunk 1 2016-04-08 00:22:42.387693  chunk 2 2016-04-08 00:23:43.124381 chunk 3 2016-04-08 00:25:30.249369 chunk 4 2016-04-08 00:28:11.922305 chunk 5 2016-04-08 00:32:00.357365

Is there a workaround to speed this up? I have 2900 chunks to process so any help is appreciated!

Open to any other suggestions in Python!

846

asked Apr 08 '16 00:04

jfive

1 Answers

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

pd.concat returns a new DataFrame. Space has to be allocated for the new DataFrame, and data from the old DataFrames have to be copied into the new DataFrame. Consider the amount of copying required by this line inside the for-loop (assuming each x has size 1):

super_x = pd.concat([super_x, x], axis=0)  | iteration | size of old super_x | size of x | copying required | |         0 |                   0 |         1 |                1 | |         1 |                   1 |         1 |                2 | |         2 |                   2 |         1 |                3 | |       ... |                     |           |                  | |       N-1 |                 N-1 |         1 |                N |

1 + 2 + 3 + ... + N = N(N+1)/2. So there is O(N**2) copies required to complete the loop.

Now consider

super_x = [] for i, df_chunk in enumerate(df_list):     [x, y] = preprocess_data(df_chunk)     super_x.append(x) super_x = pd.concat(super_x, axis=0)

Appending to a list is an O(1) operation and does not require copying. Now there is a single call to pd.concat after the loop is done. This call to pd.concat requires N copies to be made, since super_x contains N DataFrames of size 1. So when constructed this way, super_x requires O(N) copies.

195

answered Sep 21 '22 04:09

unutbu

Related questions
                            
                                How to update a document using elasticsearch-py?
                            
                                list memory usage in ipython and jupyter
                            
                                Pandas DataFrames with NaNs equality comparison
                            
                                Matplotlib: How to plot images instead of points?
                            
                                Try-except clause with an empty except code [duplicate]
                            
                                Find matching rows in 2 dimensional numpy array
                            
                                Apply StringIndexer to several columns in a PySpark Dataframe
                            
                                Modify bound variables of a closure in Python
                            
                                Communicating with a running python daemon
                            
                                How to create a bytes or bytearray of given length filled with zeros in Python?
                            
                                Still can't install scipy due to missing fortran compiler after brew install gcc on Mac OS X
                            
                                In Python, how do I check the size of a StringIO object?
                            
                                creating multiple excel worksheets using data in a pandas dataframe
                            
                                How to get Python exception text
                            
                                __init__ as a constructor?
                            
                                How to right align level field in Python logging.Formatter
                            
                                Add a non-model field on a ModelSerializer in DRF 3
                            
                                Numpy remove a dimension from np array
                            
                                Encoding nested python object in JSON
                            
                                UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does concatenation of DataFrames get exponentially slower?

Tags:

performance

python

concatenation

pandas

processing-efficiency

jfive

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us