Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to concatenate multiple pandas.DataFrames without running into MemoryError

I have three DataFrames that I'm trying to concatenate.

concat_df = pd.concat([df1, df2, df3]) 

This results in a MemoryError. How can I resolve this?

Note that most of the existing similar questions are on MemoryErrors occuring when reading large files. I don't have that problem. I have read my files in into DataFrames. I just can't concatenate that data.

like image 882
bluprince13 Avatar asked Jun 23 '17 07:06

bluprince13


People also ask

Can you merge multiple DataFrames in pandas at once?

Pandas' merge and concat can be used to combine subsets of a DataFrame, or even data from different files. join function combines DataFrames based on index or column. Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.

How do I merge two DataFrames in pandas without duplicates?

To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.

Which is faster append or concat pandas?

In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.


1 Answers

The problem is, like viewed in the others answers, a problem of memory. And a solution is to store data on disk, then to build an unique dataframe.

With such huge data, performance is an issue.

csv solutions are very slow, since conversion in text mode occurs. HDF5 solutions are shorter, more elegant and faster since using binary mode. I propose a third way in binary mode, with pickle, which seems to be even faster, but more technical and needing some more room. And a fourth, by hand.

Here the code:

import numpy as np import pandas as pd import os import pickle  # a DataFrame factory: dfs=[] for i in range(10):     dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4)))      # a csv solution def bycsv(dfs):     md,hd='w',True     for df in dfs:         df.to_csv('df_all.csv',mode=md,header=hd,index=None)         md,hd='a',False     #del dfs     df_all=pd.read_csv('df_all.csv',index_col=None)     os.remove('df_all.csv')      return df_all           

Better solutions :

def byHDF(dfs):     store=pd.HDFStore('df_all.h5')     for df in dfs:         store.append('df',df,data_columns=list('0123'))     #del dfs     df=store.select('df')     store.close()     os.remove('df_all.h5')     return df  def bypickle(dfs):     c=[]     with open('df_all.pkl','ab') as f:         for df in dfs:             pickle.dump(df,f)             c.append(len(df))         #del dfs     with open('df_all.pkl','rb') as f:         df_all=pickle.load(f)         offset=len(df_all)         df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4)))                  for size in c[1:]:             df=pickle.load(f)             df_all.iloc[offset:offset+size]=df.values              offset+=size     os.remove('df_all.pkl')     return df_all      

For homogeneous dataframes, we can do even better :

def byhand(dfs):     mtot=0     with open('df_all.bin','wb') as f:         for df in dfs:             m,n =df.shape             mtot += m             f.write(df.values.tobytes())             typ=df.values.dtype                     #del dfs     with open('df_all.bin','rb') as f:         buffer=f.read()         data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)         df_all=pd.DataFrame(data=data,columns=list(range(n)))      os.remove('df_all.bin')     return df_all 

And some tests on (little, 32 Mb) data to compare performance. you have to multiply by about 128 for 4 Gb.

In [92]: %time w=bycsv(dfs) Wall time: 8.06 s  In [93]: %time x=byHDF(dfs) Wall time: 547 ms  In [94]: %time v=bypickle(dfs) Wall time: 219 ms  In [95]: %time y=byhand(dfs) Wall time: 109 ms 

A check :

In [195]: (x.values==w.values).all() Out[195]: True  In [196]: (x.values==v.values).all() Out[196]: True  In [197]: (x.values==y.values).all() Out[196]: True                

Of course all of that must be improved and tuned to fit your problem.

For exemple df3 can be split in chuncks of size 'total_memory_size - df_total_size' to be able to run bypickle.

I can edit it if you give more information on your data structure and size if you want. Beautiful question !

like image 150
B. M. Avatar answered Oct 09 '22 13:10

B. M.