OK I am experimenting with pandas to load around a 30GB csv file with 40 million+ rows and 150+ columns in to HDFStore. The majority of the columns are strings, followed by numerical and dates.
I have never really used numpy, pandas or pytables before but have played around with data frames in R.
I am currently just storing a sample file of around 20000 rows in to HDFStore. When I try to read the table from HDFStore the table is loaded to memory and memory usage goes up by ~100MB
f=HDFStore('myfile.h5')
g=f['df']
Then I delete the variable containing the DataFrame:
del g
At the point the memory usage decreases by about 5MB
If I again load the data into g using g=f['df']
, the memory usage shoots up another 100MB
Cleanup only happens when I actually close the window.
The way the data is organized, I am probably going to divide the data into individual tables with the max table size around 1GB which can fit into memory and then use it one at a time. However, this approach will not work if I am not able to clear memory.
Any ideas on how I can achieve this?
To answer on the second point of OP's question ("how to free memory")
Short answer
Closing the store and deleting the selected dataframe does not work, however I found a call to gc.collect()
clears up memory well after you delete the dataframe.
Example
In the example below, memory is cleaned automatically as expected:
data=numpy.random.rand(10000,1000) # memory up by 78MB
df=pandas.DataFrame(data) # memory up by 1 MB
store = pandas.HDFStore('test.h5') # memory up by 3 MB
store.append('df', df) # memory up by 9 MB (why?!?!)
del data # no change in memory
del df # memory down by 78 MB
store.close() # no change in memory
gc.collect() # no change in memory (1)
(1) the store is still in memory, albeit closed
Now suppose we continue from above and reopen store
as per below. Memory is cleaned only after gc.collect() is called:
store = pandas.HDFStore('test.h5') # no change in memory (2)
df = store.select('df') # memory up by 158MB ?! (3)
del df # no change in memory
store.close() # no change in memory
gc.collect() # memory down by 158 MB (4)
(2) the store never left, (3) I have read that selection of a table migth take up as much as 3x the sixe of the table, (4) the store is still there
Finally I also tried to do a .copy()
of the df on open (df = store.select('df')
). Do not do this, it creates a monster in memory that cannot be garbage-collected afterwards.
Final question If a DF in memory is 100MB, I understand it might occupy 2-3x size in memory while loading but why does it stay at 200MB in memory after I select it from an HDFStore and close the store?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With