Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas HDFStore unload dataframe from memory

OK I am experimenting with pandas to load around a 30GB csv file with 40 million+ rows and 150+ columns in to HDFStore. The majority of the columns are strings, followed by numerical and dates.

I have never really used numpy, pandas or pytables before but have played around with data frames in R.

I am currently just storing a sample file of around 20000 rows in to HDFStore. When I try to read the table from HDFStore the table is loaded to memory and memory usage goes up by ~100MB

f=HDFStore('myfile.h5')
g=f['df']

Then I delete the variable containing the DataFrame:

del g

At the point the memory usage decreases by about 5MB

If I again load the data into g using g=f['df'], the memory usage shoots up another 100MB

Cleanup only happens when I actually close the window.

The way the data is organized, I am probably going to divide the data into individual tables with the max table size around 1GB which can fit into memory and then use it one at a time. However, this approach will not work if I am not able to clear memory.

Any ideas on how I can achieve this?

like image 264
smartexpert Avatar asked Nov 02 '22 16:11

smartexpert


1 Answers

To answer on the second point of OP's question ("how to free memory")

Short answer

Closing the store and deleting the selected dataframe does not work, however I found a call to gc.collect() clears up memory well after you delete the dataframe.

Example

In the example below, memory is cleaned automatically as expected:

data=numpy.random.rand(10000,1000)         # memory up by 78MB
df=pandas.DataFrame(data)                  # memory up by 1 MB

store = pandas.HDFStore('test.h5')         # memory up by 3 MB
store.append('df', df)                     # memory up by 9 MB (why?!?!)

del data                                   # no change in memory
del df                                     # memory down by 78 MB

store.close()                              # no change in memory
gc.collect()                               # no change in memory (1) 

(1) the store is still in memory, albeit closed

Now suppose we continue from above and reopen store as per below. Memory is cleaned only after gc.collect() is called:

store = pandas.HDFStore('test.h5')         # no change in memory (2) 
df = store.select('df')                    # memory up by 158MB ?! (3)
del df                                     # no change in memory
store.close()                              # no change in memory
gc.collect()                               # memory down by 158 MB (4)

(2) the store never left, (3) I have read that selection of a table migth take up as much as 3x the sixe of the table, (4) the store is still there

Finally I also tried to do a .copy() of the df on open (df = store.select('df')). Do not do this, it creates a monster in memory that cannot be garbage-collected afterwards.

Final question If a DF in memory is 100MB, I understand it might occupy 2-3x size in memory while loading but why does it stay at 200MB in memory after I select it from an HDFStore and close the store?

like image 157
Pythonic Avatar answered Nov 09 '22 13:11

Pythonic