I searched almost all over the internet and somehow none of the approaches seem to work in my case.
I have two large csv files (each with a million+ rows and about 300-400MB in size). They are loading fine into data frames using the read_csv function without having to use the chunksize parameter. I even performed certain minor operations on this data like new column generation, filtering, etc.
However, when I try to merge these two frames, I get a MemoryError. I have even tried to use SQLite to accomplish the merge, but in vain. The operation takes forever.
Mine is a Windows 7 PC with 8GB RAM. The Python version is 2.7
Thank you.
Edit: I tried chunking methods too. When I do this, I don't get MemoryError, but the RAM usage explodes and my system crashes.
The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .
Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.
Pandas Join vs Merge Differences The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you wanted to join on, this also supports joining on indexes and combination of index and columns.
When you are merging data using pandas.merge it will use df1 memory, df2 memory and merge_df memory. I believe that it is why you get a memory error. You should export df2 to a csv file and use chunksize option and merge data.
It might be a better way but you can try this. *for large data set you can use chunksize option in pandas.read_csv
df1 = pd.read_csv("yourdata.csv") df2 = pd.read_csv("yourdata2.csv") df2_key = df2.Colname2 # creating a empty bucket to save result df_result = pd.DataFrame(columns=(df1.columns.append(df2.columns)).unique()) df_result.to_csv("df3.csv",index_label=False) # save data which only appear in df1 # sorry I was doing left join here. no need to run below two line. # df_result = df1[df1.Colname1.isin(df2.Colname2)!=True] # df_result.to_csv("df3.csv",index_label=False, mode="a") # deleting df2 to save memory del(df2) def preprocess(x): df2=pd.merge(df1,x, left_on = "Colname1", right_on = "Colname2") df2.to_csv("df3.csv",mode="a",header=False,index=False) reader = pd.read_csv("yourdata2.csv", chunksize=1000) # chunksize depends with you colsize [preprocess(r) for r in reader]
this will save merged data as df3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With