My code includes this while
loop:
while A.shape[0] > 0:
idx = A.score.values.argmax()
one_center = A.coordinate.iloc[idx]
# peak_centers and peak_scores are python lists
peak_centers.append(one_center)
peak_scores.append(A.score.iloc[idx])
# exclude the coordinates around the selected peak
A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]
A
is a pandas DataFrame
that looks like this:
score coordinate
0 0.158 1
1 0.167 2
2 0.175 3
3 0.183 4
4 0.190 5
I am trying to find the maximum score (a peak) in A
, then exclude some coordinates (a couple hundred in this case) around the previously found peak, then find the next peak and so on.
A
here is a very large pandas DataFrame
. Before running this while
loop, the ipython session used 20% of machine memory. I thought that running this while
loop would only cause memory consumption to drop because I am excluding some data from the DataFrame
. However, what I observe is that memory usage keeps increasing and at some point machine memory is exhausted.
Is there anything I missed here? Do I need to explicitly free memory somewhere?
Here is a short script that can replicate the behavior using random data:
import numpy as np
import pandas as pd
A = pd.DataFrame({'score':np.random.random(132346018), 'coordinate':np.arange(1, 132346019)})
peak_centers = []
peak_scores = []
exclusion = 147
while A.shape[0] > 0:
idx = A.score.values.argmax()
one_center = A.coordinate.iloc[idx]
# peak_centers and peak_scores are python lists
peak_centers.append(one_center)
peak_scores.append(A.score.iloc[idx])
# exclude the coordinates around the selected peak
A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]
# terminated the loop after memory consumption gets to 90% of machine memory
# but peak_centers and peak_scores are still short lists
print len(peak_centers)
# output is 16
Your DataFrame
is simply too big to deal with. The memory load doubles while you're executing this line:
A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]
That's because you're assigning a new value to A
, so memory is allocated for a new DataFrame
while the old one is being filtered. The new one is almost the same size as the old one because you're selecting almost all of the data points. That eats up enough memory for two copies of A
, and that's without taking account of extra memory for the bookkeeping that the loc
implementation does.
Apparently loc
causes pandas to allocate enough memory for an additional copy of the data. I'm not sure why this is. I presume it's a performance optimization of some sort. This means that you end up consuming, at peak memory usage, four times the size of the DataFrame
. Once loc
is done and the unallocated memory is freed—which you can force by calling gc.collect()
—the memory load drops down to double the size of the DataFrame
. On the next call to loc
, everything is doubled and you're back up to a quadruple load. Garbage-collect again, and you're back down to double. This will continue as long as you like.
To verify what's going on, run this modified version of your code:
import numpy as np
import pandas as pd
import gc
A = pd.DataFrame({'score':np.random.random(32346018), 'coordinate':np.arange(1, 32346019)})
peak_centers = []
peak_scores = []
exclusion = 147
count = 0
while A.shape[0] > 0:
gc.collect() # Force garbage collection.
count += 1 # Increment the iteration count.
print('iteration %d, shape %s' % (count, A.shape))
raw_input() # Wait for the user to press Enter.
idx = A.score.values.argmax()
one_center = A.coordinate.iloc[idx]
# peak_centers and peak_scores are python lists
peak_centers.append(one_center)
peak_scores.append(A.score.iloc[idx])
print(len(peak_centers), len(peak_scores))
# exclude the coordinates around the selected peak
A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]
Press the Enter key between iterations and keep an eye on memory usage with top
or a similar tool.
At the start of the first iteration, you'll see a memory usage of x
percent. On the second iteration, after loc
was called for the first time, memory usage doubles to 2x
. Subsequently, you'll see it go up to 4x
during each call to loc
, then go down to 2x
after garbage collection.
Use DataFrame.drop
with inplace=True
if you want to destructively mutate A
without copying a large subset of A
's data.
places_to_drop = ~(A.coordinate - one_center).between(-exclusion, exclusion)
A.drop(A.index[np.where(places_to_drop)], inplace=True)
The place where the original usage of loc
ultimately bottoms out is in the _NDFrameIndexer
method _getitem_iterable
. _LocIndexer
is a child class of _NDFrameIndexer
and an instance of _LocIndexer
gets created and populates the loc
property of a DataFrame
.
In particular, _getitem_iterable
performs a check for a Boolean index, which occurs in your case. Then a new array of the Boolean locations is created (which is wasteful of memory when key
is already in Boolean format).
inds, = key.nonzero()
and then finally the "true" locations are returned within a copy:
return self.obj.take(inds, axis=axis, convert=False)
From the code: key
will be your Boolean index (i.e. the result of the expression (A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)
) and self.obj
will be the parent DataFrame
instance from which loc
was invoked, so obj
here is just A
.
The DataFrame.take
documentation explains that the default behavior is to make a copy. In the current implementation of the indexers, there is no mechanism allowing you to pass a keyword argument through that will eventually be used to perform take
without making a copy.
On any reasonable modern machine, using the drop
approach should be problem-free for the size of data you describe, so it is not the case that A
's size is to blame.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With