Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MemoryError when I merge two Pandas data frames

I searched almost all over the internet and somehow none of the approaches seem to work in my case.

I have two large csv files (each with a million+ rows and about 300-400MB in size). They are loading fine into data frames using the read_csv function without having to use the chunksize parameter. I even performed certain minor operations on this data like new column generation, filtering, etc.

However, when I try to merge these two frames, I get a MemoryError. I have even tried to use SQLite to accomplish the merge, but in vain. The operation takes forever.

Mine is a Windows 7 PC with 8GB RAM. The Python version is 2.7

Thank you.

Edit: I tried chunking methods too. When I do this, I don't get MemoryError, but the RAM usage explodes and my system crashes.

like image 354
Ronit Chidara Avatar asked Nov 20 '17 06:11

Ronit Chidara


People also ask

How do I merge two data frames in Pandas?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

Can you merge more than 2 DataFrames in Pandas?

We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .

Is Pandas merge efficient?

Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.

What is difference between joining and merging in Pandas DataFrame?

Pandas Join vs Merge Differences The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you wanted to join on, this also supports joining on indexes and combination of index and columns.


Video Answer


1 Answers

When you are merging data using pandas.merge it will use df1 memory, df2 memory and merge_df memory. I believe that it is why you get a memory error. You should export df2 to a csv file and use chunksize option and merge data.

It might be a better way but you can try this. *for large data set you can use chunksize option in pandas.read_csv

df1 = pd.read_csv("yourdata.csv") df2 = pd.read_csv("yourdata2.csv") df2_key = df2.Colname2  # creating a empty bucket to save result df_result = pd.DataFrame(columns=(df1.columns.append(df2.columns)).unique()) df_result.to_csv("df3.csv",index_label=False)  # save data which only appear in df1 # sorry I was doing left join here. no need to run below two line. # df_result = df1[df1.Colname1.isin(df2.Colname2)!=True] # df_result.to_csv("df3.csv",index_label=False, mode="a")  # deleting df2 to save memory del(df2)  def preprocess(x):     df2=pd.merge(df1,x, left_on = "Colname1", right_on = "Colname2")     df2.to_csv("df3.csv",mode="a",header=False,index=False)  reader = pd.read_csv("yourdata2.csv", chunksize=1000) # chunksize depends with you colsize  [preprocess(r) for r in reader] 

this will save merged data as df3.

like image 132
T_cat Avatar answered Sep 29 '22 19:09

T_cat