I noticed that memory allocated while iterating through a Pandas GroupBy object is not deallocated after iteration. I use resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
(second answer in this post for details) to measure the total amount of active memory used by the Python process.
import resource
import gc
import pandas as pd
import numpy as np
i = np.random.choice(list(range(100)), 4000)
cols = list(range(int(2e4)))
df = pd.DataFrame(1, index=i, columns=cols)
gb = df.groupby(level=0)
# gb = list(gb)
for i in range(3):
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)
for idx, x in enumerate(gb):
if idx == 0:
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)
# del idx, x
# gc.collect()
prints the following total active memory (in gb)
0.671732
1.297424
1.297952
1.923288
1.923288
2.548624
Uncommenting del idx, x
and gc.collect()
fixes the problem. I do however have to del
all variables that reference the DataFrames returned by iterating over the groupby (which can be a pain depending on the code in the inner for loop). The new printed memory usages become:
0.671768
1.297412
1.297992
1.297992
1.297992
1.297992
Alternatively I can uncomment gb = list(gb)
. The resulting memory usages are roughly the same as those from the previous solution:
1.32874
1.32874
1.32874
1.32874
1.32874
1.32874
This is very interesting! You do not need del idx, x
. Only using gc.collect()
worked to keep memory constant for me. This is much cleaner that having the del
statements inside the loop.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With