MemoryError when I merge two Pandas data frames

Tags:

I searched almost all over the internet and somehow none of the approaches seem to work in my case.

I have two large csv files (each with a million+ rows and about 300-400MB in size). They are loading fine into data frames using the read_csv function without having to use the chunksize parameter. I even performed certain minor operations on this data like new column generation, filtering, etc.

However, when I try to merge these two frames, I get a MemoryError. I have even tried to use SQLite to accomplish the merge, but in vain. The operation takes forever.

Mine is a Windows 7 PC with 8GB RAM. The Python version is 2.7

Thank you.

Edit: I tried chunking methods too. When I do this, I don't get MemoryError, but the RAM usage explodes and my system crashes.

354

asked Nov 20 '17 06:11

Ronit Chidara

Video Answer

1 Answers

When you are merging data using pandas.merge it will use df1 memory, df2 memory and merge_df memory. I believe that it is why you get a memory error. You should export df2 to a csv file and use chunksize option and merge data.

It might be a better way but you can try this. *for large data set you can use chunksize option in pandas.read_csv

df1 = pd.read_csv("yourdata.csv") df2 = pd.read_csv("yourdata2.csv") df2_key = df2.Colname2  # creating a empty bucket to save result df_result = pd.DataFrame(columns=(df1.columns.append(df2.columns)).unique()) df_result.to_csv("df3.csv",index_label=False)  # save data which only appear in df1 # sorry I was doing left join here. no need to run below two line. # df_result = df1[df1.Colname1.isin(df2.Colname2)!=True] # df_result.to_csv("df3.csv",index_label=False, mode="a")  # deleting df2 to save memory del(df2)  def preprocess(x):     df2=pd.merge(df1,x, left_on = "Colname1", right_on = "Colname2")     df2.to_csv("df3.csv",mode="a",header=False,index=False)  reader = pd.read_csv("yourdata2.csv", chunksize=1000) # chunksize depends with you colsize  [preprocess(r) for r in reader]

this will save merged data as df3.

132

answered Sep 29 '22 19:09

T_cat

Related questions
                            
                                Where are Pip installation logs?
                            
                                Add class to Django label_tag() output
                            
                                copy.deepcopy vs pickle
                            
                                expanding (adding a row or column) a scipy.sparse matrix
                            
                                Alembic --autogenerate producing empty migration
                            
                                'is' operator behaves differently when comparing strings with spaces
                            
                                beautiful soup getting tag.id
                            
                                Index multiple, non-adjacent ranges in numpy
                            
                                Why does redefining a variable used in a generator give strange results? [duplicate]
                            
                                How to query a table, in sqlalchemy
                            
                                Python Curses Handling Window (Terminal) Resize
                            
                                Python: Create Dictionary from Text/File that's in Dictionary Format
                            
                                scraping the file with html saved in local system
                            
                                Reindexing pandas timeseries from object dtype to datetime dtype
                            
                                What is a namespace object?
                            
                                redirecting with url_for to a path with query params in flask
                            
                                How to set default colormap in Matplotlib
                            
                                How to combine multiple regex into single one in python?
                            
                                ValueError: Tensor must be from the same graph as Tensor with Bidirectinal RNN in Tensorflow
                            
                                How to merge 2 ordered dictionaries in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MemoryError when I merge two Pandas data frames

Tags:

python

merge

pandas

out-of-memory

Ronit Chidara

People also ask

Video Answer

1 Answers

T_cat

Recent Activity

Donate For Us