How to concatenate multiple pandas.DataFrames without running into MemoryError

Tags:

I have three DataFrames that I'm trying to concatenate.

concat_df = pd.concat([df1, df2, df3])

This results in a MemoryError. How can I resolve this?

Note that most of the existing similar questions are on MemoryErrors occuring when reading large files. I don't have that problem. I have read my files in into DataFrames. I just can't concatenate that data.

882

asked Jun 23 '17 07:06

bluprince13

1 Answers

The problem is, like viewed in the others answers, a problem of memory. And a solution is to store data on disk, then to build an unique dataframe.

With such huge data, performance is an issue.

csv solutions are very slow, since conversion in text mode occurs. HDF5 solutions are shorter, more elegant and faster since using binary mode. I propose a third way in binary mode, with pickle, which seems to be even faster, but more technical and needing some more room. And a fourth, by hand.

Here the code:

import numpy as np import pandas as pd import os import pickle  # a DataFrame factory: dfs=[] for i in range(10):     dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4)))      # a csv solution def bycsv(dfs):     md,hd='w',True     for df in dfs:         df.to_csv('df_all.csv',mode=md,header=hd,index=None)         md,hd='a',False     #del dfs     df_all=pd.read_csv('df_all.csv',index_col=None)     os.remove('df_all.csv')      return df_all

Better solutions :

def byHDF(dfs):     store=pd.HDFStore('df_all.h5')     for df in dfs:         store.append('df',df,data_columns=list('0123'))     #del dfs     df=store.select('df')     store.close()     os.remove('df_all.h5')     return df  def bypickle(dfs):     c=[]     with open('df_all.pkl','ab') as f:         for df in dfs:             pickle.dump(df,f)             c.append(len(df))         #del dfs     with open('df_all.pkl','rb') as f:         df_all=pickle.load(f)         offset=len(df_all)         df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4)))                  for size in c[1:]:             df=pickle.load(f)             df_all.iloc[offset:offset+size]=df.values              offset+=size     os.remove('df_all.pkl')     return df_all

For homogeneous dataframes, we can do even better :

def byhand(dfs):     mtot=0     with open('df_all.bin','wb') as f:         for df in dfs:             m,n =df.shape             mtot += m             f.write(df.values.tobytes())             typ=df.values.dtype                     #del dfs     with open('df_all.bin','rb') as f:         buffer=f.read()         data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)         df_all=pd.DataFrame(data=data,columns=list(range(n)))      os.remove('df_all.bin')     return df_all

And some tests on (little, 32 Mb) data to compare performance. you have to multiply by about 128 for 4 Gb.

In [92]: %time w=bycsv(dfs) Wall time: 8.06 s  In [93]: %time x=byHDF(dfs) Wall time: 547 ms  In [94]: %time v=bypickle(dfs) Wall time: 219 ms  In [95]: %time y=byhand(dfs) Wall time: 109 ms

A check :

In [195]: (x.values==w.values).all() Out[195]: True  In [196]: (x.values==v.values).all() Out[196]: True  In [197]: (x.values==y.values).all() Out[196]: True

Of course all of that must be improved and tuned to fit your problem.

For exemple df3 can be split in chuncks of size 'total_memory_size - df_total_size' to be able to run bypickle.

I can edit it if you give more information on your data structure and size if you want. Beautiful question !

150

answered Oct 09 '22 13:10

B. M.

Related questions
                            
                                How to get the duration of video using cv2
                            
                                Beautiful Soup to parse url to get another urls data
                            
                                Pythonic Circular List
                            
                                Nested dictionary comprehension python
                            
                                DBSCAN for clustering of geographic location data
                            
                                Docker Kafka w/ Python consumer
                            
                                How to make Django template engine to render in memory templates?
                            
                                python selenium, find out when a download has completed?
                            
                                How to create random orthonormal matrix in python numpy
                            
                                Easiest way to turn a list into an HTML table in python?
                            
                                Is it possible to change an instance's method implementation without changing all other instances of the same class? [duplicate]
                            
                                Upper memory limit?
                            
                                Add an item between each item already in the list [duplicate]
                            
                                PySide / PyQt detect if user trying to close window
                            
                                Draw axis lines or the origin for Matplotlib contour plot
                            
                                "Unused import warning" and pylint
                            
                                Python argparse integer condition (>=12)
                            
                                Short Python Code to say "Pick the lower value"?
                            
                                How to Print "Pretty" String Output in Python
                            
                                Import NumPy on PyCharm

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to concatenate multiple pandas.DataFrames without running into MemoryError

Tags:

python

memory-management

memory

pandas

bluprince13

People also ask

1 Answers

B. M.

Recent Activity

Donate For Us