Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory usage accumulated in while loop

My code includes this while loop:

while A.shape[0] > 0:
    idx = A.score.values.argmax()
    one_center = A.coordinate.iloc[idx]
    # peak_centers and peak_scores are python lists
    peak_centers.append(one_center)
    peak_scores.append(A.score.iloc[idx])
    # exclude the coordinates around the selected peak
    A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]

A is a pandas DataFrame that looks like this:

   score  coordinate
0  0.158           1
1  0.167           2
2  0.175           3
3  0.183           4
4  0.190           5

I am trying to find the maximum score (a peak) in A, then exclude some coordinates (a couple hundred in this case) around the previously found peak, then find the next peak and so on.

A here is a very large pandas DataFrame. Before running this while loop, the ipython session used 20% of machine memory. I thought that running this while loop would only cause memory consumption to drop because I am excluding some data from the DataFrame. However, what I observe is that memory usage keeps increasing and at some point machine memory is exhausted.

Is there anything I missed here? Do I need to explicitly free memory somewhere?

Here is a short script that can replicate the behavior using random data:

import numpy as np
import pandas as pd

A = pd.DataFrame({'score':np.random.random(132346018), 'coordinate':np.arange(1, 132346019)})
peak_centers = []
peak_scores = []
exclusion = 147
while A.shape[0] > 0:
    idx = A.score.values.argmax()
    one_center = A.coordinate.iloc[idx]
    # peak_centers and peak_scores are python lists
    peak_centers.append(one_center)
    peak_scores.append(A.score.iloc[idx])
    # exclude the coordinates around the selected peak
    A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]

# terminated the loop after memory consumption gets to 90% of machine memory
# but peak_centers and peak_scores are still short lists
print len(peak_centers)
# output is 16
like image 339
qkhhly Avatar asked Aug 24 '15 18:08

qkhhly


2 Answers

Your DataFrame is simply too big to deal with. The memory load doubles while you're executing this line:

A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]

That's because you're assigning a new value to A, so memory is allocated for a new DataFrame while the old one is being filtered. The new one is almost the same size as the old one because you're selecting almost all of the data points. That eats up enough memory for two copies of A, and that's without taking account of extra memory for the bookkeeping that the loc implementation does.

Apparently loc causes pandas to allocate enough memory for an additional copy of the data. I'm not sure why this is. I presume it's a performance optimization of some sort. This means that you end up consuming, at peak memory usage, four times the size of the DataFrame. Once loc is done and the unallocated memory is freed—which you can force by calling gc.collect()—the memory load drops down to double the size of the DataFrame. On the next call to loc, everything is doubled and you're back up to a quadruple load. Garbage-collect again, and you're back down to double. This will continue as long as you like.

To verify what's going on, run this modified version of your code:

import numpy as np
import pandas as pd
import gc

A = pd.DataFrame({'score':np.random.random(32346018), 'coordinate':np.arange(1, 32346019)})
peak_centers = []
peak_scores = []
exclusion = 147
count = 0
while A.shape[0] > 0:
    gc.collect()  # Force garbage collection.
    count += 1    # Increment the iteration count.
    print('iteration %d, shape %s' % (count, A.shape))
    raw_input()   # Wait for the user to press Enter.
    idx = A.score.values.argmax()
    one_center = A.coordinate.iloc[idx]
    # peak_centers and peak_scores are python lists
    peak_centers.append(one_center)
    peak_scores.append(A.score.iloc[idx])
    print(len(peak_centers), len(peak_scores))
    # exclude the coordinates around the selected peak
    A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]

Press the Enter key between iterations and keep an eye on memory usage with top or a similar tool.

At the start of the first iteration, you'll see a memory usage of x percent. On the second iteration, after loc was called for the first time, memory usage doubles to 2x. Subsequently, you'll see it go up to 4x during each call to loc, then go down to 2x after garbage collection.

like image 128
Michael Laszlo Avatar answered Sep 17 '22 14:09

Michael Laszlo


Use DataFrame.drop with inplace=True if you want to destructively mutate A without copying a large subset of A's data.

places_to_drop = ~(A.coordinate - one_center).between(-exclusion, exclusion)
A.drop(A.index[np.where(places_to_drop)], inplace=True) 

The place where the original usage of loc ultimately bottoms out is in the _NDFrameIndexer method _getitem_iterable. _LocIndexer is a child class of _NDFrameIndexer and an instance of _LocIndexer gets created and populates the loc property of a DataFrame.

In particular, _getitem_iterable performs a check for a Boolean index, which occurs in your case. Then a new array of the Boolean locations is created (which is wasteful of memory when key is already in Boolean format).

inds, = key.nonzero()

and then finally the "true" locations are returned within a copy:

return self.obj.take(inds, axis=axis, convert=False)

From the code: key will be your Boolean index (i.e. the result of the expression (A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)) and self.obj will be the parent DataFrame instance from which loc was invoked, so obj here is just A.

The DataFrame.take documentation explains that the default behavior is to make a copy. In the current implementation of the indexers, there is no mechanism allowing you to pass a keyword argument through that will eventually be used to perform take without making a copy.

On any reasonable modern machine, using the drop approach should be problem-free for the size of data you describe, so it is not the case that A's size is to blame.

like image 21
ely Avatar answered Sep 19 '22 14:09

ely