Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: garbage-collect drop'ped columns to release memory

I'm handling a large dataset with about 20,000,000 rows and 4 columns. Unfortunately, the available memory on my machine (~16GB) is not sufficient.

Example (Time is seconds since midnight):

           Date   Time   Price     Vol
0      20010102  34222  51.750  227900
1      20010102  34234  51.750    5600
2      20010102  34236  51.875   14400

Then I transform the dataset into a proper time-series object:

                         Date   Time   Price     Vol
2001-01-02 09:30:22  20010102  34222  51.750  227900
2001-01-02 09:30:34  20010102  34234  51.750    5600
2001-01-02 09:30:36  20010102  34236  51.875   14400
2001-01-02 09:31:03  20010102  34263  51.750    2200

To release memory I want to drop the redundant Date and Time columns. I do it with the .drop() method but the memory is not released. I also tried to call gc.collect() afterwards but that did not help either.

This is the code I call to handle the described actions. The del part releases memory but not the drop part.

# Store date and time components
m, s = divmod(data.Time.values, 60)
h, m = divmod(m, 60)
s, m, h = pd.Series(np.char.mod('%02d', s)), pd.Series(np.char.mod('%02d', m)), pd.Series(np.char.mod('%02d', h))

# Set time series index
data = data.set_index(pd.to_datetime(data.Date.reset_index(drop=True).apply(str) + h + m + s, format='%Y%m%d%H%M%S'))

# Remove redundant information
del s, m, h
data.drop('Date', axis=1, inplace=True)
data.drop('Time', axis=1, inplace=True)

How can I release the memory from the pandas data frame?

like image 794
BayerSe Avatar asked Jul 18 '15 13:07

BayerSe


1 Answers

del data['Date']
del data['Time']

This will releases memory.

like image 194
Sergey Vasilev Avatar answered Oct 23 '22 22:10

Sergey Vasilev