I have three DataFrames that I'm trying to concatenate.
concat_df = pd.concat([df1, df2, df3])
This results in a MemoryError. How can I resolve this?
Note that most of the existing similar questions are on MemoryErrors occuring when reading large files. I don't have that problem. I have read my files in into DataFrames. I just can't concatenate that data.
Pandas' merge and concat can be used to combine subsets of a DataFrame, or even data from different files. join function combines DataFrames based on index or column. Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.
To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.
In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.
The problem is, like viewed in the others answers, a problem of memory. And a solution is to store data on disk, then to build an unique dataframe.
With such huge data, performance is an issue.
csv solutions are very slow, since conversion in text mode occurs. HDF5 solutions are shorter, more elegant and faster since using binary mode. I propose a third way in binary mode, with pickle, which seems to be even faster, but more technical and needing some more room. And a fourth, by hand.
Here the code:
import numpy as np import pandas as pd import os import pickle # a DataFrame factory: dfs=[] for i in range(10): dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4))) # a csv solution def bycsv(dfs): md,hd='w',True for df in dfs: df.to_csv('df_all.csv',mode=md,header=hd,index=None) md,hd='a',False #del dfs df_all=pd.read_csv('df_all.csv',index_col=None) os.remove('df_all.csv') return df_all
Better solutions :
def byHDF(dfs): store=pd.HDFStore('df_all.h5') for df in dfs: store.append('df',df,data_columns=list('0123')) #del dfs df=store.select('df') store.close() os.remove('df_all.h5') return df def bypickle(dfs): c=[] with open('df_all.pkl','ab') as f: for df in dfs: pickle.dump(df,f) c.append(len(df)) #del dfs with open('df_all.pkl','rb') as f: df_all=pickle.load(f) offset=len(df_all) df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4))) for size in c[1:]: df=pickle.load(f) df_all.iloc[offset:offset+size]=df.values offset+=size os.remove('df_all.pkl') return df_all
For homogeneous dataframes, we can do even better :
def byhand(dfs): mtot=0 with open('df_all.bin','wb') as f: for df in dfs: m,n =df.shape mtot += m f.write(df.values.tobytes()) typ=df.values.dtype #del dfs with open('df_all.bin','rb') as f: buffer=f.read() data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n) df_all=pd.DataFrame(data=data,columns=list(range(n))) os.remove('df_all.bin') return df_all
And some tests on (little, 32 Mb) data to compare performance. you have to multiply by about 128 for 4 Gb.
In [92]: %time w=bycsv(dfs) Wall time: 8.06 s In [93]: %time x=byHDF(dfs) Wall time: 547 ms In [94]: %time v=bypickle(dfs) Wall time: 219 ms In [95]: %time y=byhand(dfs) Wall time: 109 ms
A check :
In [195]: (x.values==w.values).all() Out[195]: True In [196]: (x.values==v.values).all() Out[196]: True In [197]: (x.values==y.values).all() Out[196]: True
Of course all of that must be improved and tuned to fit your problem.
For exemple df3 can be split in chuncks of size 'total_memory_size - df_total_size' to be able to run bypickle
.
I can edit it if you give more information on your data structure and size if you want. Beautiful question !
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With