I see what I think is a memory leak when running my Python script. Here is my script:
import sys
import time
class MyObj(object):
def __init__(self, filename):
with open(filename) as f:
self.att = f.read()
def myfunc(filename):
mylist = [MyObj(filename) for x in xrange(100)]
len(mylist)
return []
def main():
filename = sys.argv[1]
myfunc(filename)
time.sleep(3600)
if __name__ == '__main__':
main()
The main function calls myfunc()
which creates a list of 100 objects that each open and
read a file. After returning from myfunc()
, I'd expect memory from the 100-item list and
from reading the file to be freed since they are no longer referenced. However, when I
check the memory usage using the ps
command, the Python process uses about 10,000 KB
more memory than a Python process run from a script with lines 12 and 13 commented out.
The strange thing is that the memory leak (if that's what it is) only seems to occur for files <128KB in size. I created a bash script to run this script with files ranging in size from 1KB to 200KB and the memory increase stopped when the files size hit 128KB. Here is the bash script:
#!/bin/bash
echo "PID RSS S TTY TIME COMMAND" > output.txt
for i in `seq 1 200`;
do
python debug_memory.py "data/stuff_${i}K.txt" &
pid=$!
sleep 0.1
ps -e -O rss | grep $pid | grep -v grep >> output.txt
kill $pid
done
Here is the output of the bash script:
PID RSS S TTY TIME COMMAND
28471 5552 S pts/16 00:00:00 python debug_memory.py data/stuff_1K.txt
28477 5656 S pts/16 00:00:00 python debug_memory.py data/stuff_2K.txt
28483 5756 S pts/16 00:00:00 python debug_memory.py data/stuff_3K.txt
28488 5852 S pts/16 00:00:00 python debug_memory.py data/stuff_4K.txt
28494 5952 S pts/16 00:00:00 python debug_memory.py data/stuff_5K.txt
28499 6052 S pts/16 00:00:00 python debug_memory.py data/stuff_6K.txt
28505 6156 S pts/16 00:00:00 python debug_memory.py data/stuff_7K.txt
28511 6256 S pts/16 00:00:00 python debug_memory.py data/stuff_8K.txt
28516 6356 S pts/16 00:00:00 python debug_memory.py data/stuff_9K.txt
28522 6452 S pts/16 00:00:00 python debug_memory.py data/stuff_10K.txt
28527 6552 S pts/16 00:00:00 python debug_memory.py data/stuff_11K.txt
28533 6656 S pts/16 00:00:00 python debug_memory.py data/stuff_12K.txt
28539 6756 S pts/16 00:00:00 python debug_memory.py data/stuff_13K.txt
28544 6852 S pts/16 00:00:00 python debug_memory.py data/stuff_14K.txt
28550 6952 S pts/16 00:00:00 python debug_memory.py data/stuff_15K.txt
28555 7056 S pts/16 00:00:00 python debug_memory.py data/stuff_16K.txt
28561 7156 S pts/16 00:00:00 python debug_memory.py data/stuff_17K.txt
28567 7252 S pts/16 00:00:00 python debug_memory.py data/stuff_18K.txt
28572 7356 S pts/16 00:00:00 python debug_memory.py data/stuff_19K.txt
28578 7452 S pts/16 00:00:00 python debug_memory.py data/stuff_20K.txt
28584 7556 S pts/16 00:00:00 python debug_memory.py data/stuff_21K.txt
28589 7652 S pts/16 00:00:00 python debug_memory.py data/stuff_22K.txt
28595 7756 S pts/16 00:00:00 python debug_memory.py data/stuff_23K.txt
28600 7852 S pts/16 00:00:00 python debug_memory.py data/stuff_24K.txt
28606 7952 S pts/16 00:00:00 python debug_memory.py data/stuff_25K.txt
28612 8052 S pts/16 00:00:00 python debug_memory.py data/stuff_26K.txt
28617 8152 S pts/16 00:00:00 python debug_memory.py data/stuff_27K.txt
28623 8252 S pts/16 00:00:00 python debug_memory.py data/stuff_28K.txt
28629 8356 S pts/16 00:00:00 python debug_memory.py data/stuff_29K.txt
28634 8452 S pts/16 00:00:00 python debug_memory.py data/stuff_30K.txt
28640 8556 S pts/16 00:00:00 python debug_memory.py data/stuff_31K.txt
28645 8656 S pts/16 00:00:00 python debug_memory.py data/stuff_32K.txt
28651 8756 S pts/16 00:00:00 python debug_memory.py data/stuff_33K.txt
28657 8856 S pts/16 00:00:00 python debug_memory.py data/stuff_34K.txt
28662 8956 S pts/16 00:00:00 python debug_memory.py data/stuff_35K.txt
28668 9056 S pts/16 00:00:00 python debug_memory.py data/stuff_36K.txt
28674 9156 S pts/16 00:00:00 python debug_memory.py data/stuff_37K.txt
28679 9256 S pts/16 00:00:00 python debug_memory.py data/stuff_38K.txt
28685 9352 S pts/16 00:00:00 python debug_memory.py data/stuff_39K.txt
28691 9452 S pts/16 00:00:00 python debug_memory.py data/stuff_40K.txt
28696 9552 S pts/16 00:00:00 python debug_memory.py data/stuff_41K.txt
28702 9656 S pts/16 00:00:00 python debug_memory.py data/stuff_42K.txt
28707 9756 S pts/16 00:00:00 python debug_memory.py data/stuff_43K.txt
28713 9852 S pts/16 00:00:00 python debug_memory.py data/stuff_44K.txt
28719 9952 S pts/16 00:00:00 python debug_memory.py data/stuff_45K.txt
28724 10052 S pts/16 00:00:00 python debug_memory.py data/stuff_46K.txt
28730 10156 S pts/16 00:00:00 python debug_memory.py data/stuff_47K.txt
28739 10256 S pts/16 00:00:00 python debug_memory.py data/stuff_48K.txt
28746 10352 S pts/16 00:00:00 python debug_memory.py data/stuff_49K.txt
28752 10452 S pts/16 00:00:00 python debug_memory.py data/stuff_50K.txt
28757 10556 S pts/16 00:00:00 python debug_memory.py data/stuff_51K.txt
28763 10656 S pts/16 00:00:00 python debug_memory.py data/stuff_52K.txt
28769 10752 S pts/16 00:00:00 python debug_memory.py data/stuff_53K.txt
28774 10852 S pts/16 00:00:00 python debug_memory.py data/stuff_54K.txt
28780 10952 S pts/16 00:00:00 python debug_memory.py data/stuff_55K.txt
28786 11052 S pts/16 00:00:00 python debug_memory.py data/stuff_56K.txt
28791 11152 S pts/16 00:00:00 python debug_memory.py data/stuff_57K.txt
28797 11256 S pts/16 00:00:00 python debug_memory.py data/stuff_58K.txt
28802 11356 S pts/16 00:00:00 python debug_memory.py data/stuff_59K.txt
28808 11452 S pts/16 00:00:00 python debug_memory.py data/stuff_60K.txt
28814 11556 S pts/16 00:00:00 python debug_memory.py data/stuff_61K.txt
28819 11656 S pts/16 00:00:00 python debug_memory.py data/stuff_62K.txt
28825 11752 S pts/16 00:00:00 python debug_memory.py data/stuff_63K.txt
28831 11852 S pts/16 00:00:00 python debug_memory.py data/stuff_64K.txt
28836 11956 S pts/16 00:00:00 python debug_memory.py data/stuff_65K.txt
28842 12052 S pts/16 00:00:00 python debug_memory.py data/stuff_66K.txt
28847 12152 S pts/16 00:00:00 python debug_memory.py data/stuff_67K.txt
28853 12256 S pts/16 00:00:00 python debug_memory.py data/stuff_68K.txt
28859 12356 S pts/16 00:00:00 python debug_memory.py data/stuff_69K.txt
28864 12452 S pts/16 00:00:00 python debug_memory.py data/stuff_70K.txt
28871 12556 S pts/16 00:00:00 python debug_memory.py data/stuff_71K.txt
28877 12652 S pts/16 00:00:00 python debug_memory.py data/stuff_72K.txt
28883 12756 S pts/16 00:00:00 python debug_memory.py data/stuff_73K.txt
28889 12856 S pts/16 00:00:00 python debug_memory.py data/stuff_74K.txt
28894 12952 S pts/16 00:00:00 python debug_memory.py data/stuff_75K.txt
28900 13056 S pts/16 00:00:00 python debug_memory.py data/stuff_76K.txt
28906 13156 S pts/16 00:00:00 python debug_memory.py data/stuff_77K.txt
28911 13256 S pts/16 00:00:00 python debug_memory.py data/stuff_78K.txt
28917 13352 S pts/16 00:00:00 python debug_memory.py data/stuff_79K.txt
28922 13452 S pts/16 00:00:00 python debug_memory.py data/stuff_80K.txt
28928 13556 S pts/16 00:00:00 python debug_memory.py data/stuff_81K.txt
28934 13652 S pts/16 00:00:00 python debug_memory.py data/stuff_82K.txt
28939 13752 S pts/16 00:00:00 python debug_memory.py data/stuff_83K.txt
28945 13852 S pts/16 00:00:00 python debug_memory.py data/stuff_84K.txt
28951 13952 S pts/16 00:00:00 python debug_memory.py data/stuff_85K.txt
28956 14052 S pts/16 00:00:00 python debug_memory.py data/stuff_86K.txt
28962 14152 S pts/16 00:00:00 python debug_memory.py data/stuff_87K.txt
28967 14256 S pts/16 00:00:00 python debug_memory.py data/stuff_88K.txt
28973 14352 S pts/16 00:00:00 python debug_memory.py data/stuff_89K.txt
28979 14456 S pts/16 00:00:00 python debug_memory.py data/stuff_90K.txt
28984 14552 S pts/16 00:00:00 python debug_memory.py data/stuff_91K.txt
28990 14652 S pts/16 00:00:00 python debug_memory.py data/stuff_92K.txt
28996 14756 S pts/16 00:00:00 python debug_memory.py data/stuff_93K.txt
29001 14852 S pts/16 00:00:00 python debug_memory.py data/stuff_94K.txt
29007 14956 S pts/16 00:00:00 python debug_memory.py data/stuff_95K.txt
29012 15052 S pts/16 00:00:00 python debug_memory.py data/stuff_96K.txt
29018 15156 S pts/16 00:00:00 python debug_memory.py data/stuff_97K.txt
29024 15252 S pts/16 00:00:00 python debug_memory.py data/stuff_98K.txt
29029 15360 S pts/16 00:00:00 python debug_memory.py data/stuff_99K.txt
29035 15456 S pts/16 00:00:00 python debug_memory.py data/stuff_100K.txt
29040 15556 S pts/16 00:00:00 python debug_memory.py data/stuff_101K.txt
29046 15652 S pts/16 00:00:00 python debug_memory.py data/stuff_102K.txt
29052 15756 S pts/16 00:00:00 python debug_memory.py data/stuff_103K.txt
29057 15852 S pts/16 00:00:00 python debug_memory.py data/stuff_104K.txt
29063 15952 S pts/16 00:00:00 python debug_memory.py data/stuff_105K.txt
29069 16056 S pts/16 00:00:00 python debug_memory.py data/stuff_106K.txt
29074 16152 S pts/16 00:00:00 python debug_memory.py data/stuff_107K.txt
29080 16256 S pts/16 00:00:00 python debug_memory.py data/stuff_108K.txt
29085 16356 S pts/16 00:00:00 python debug_memory.py data/stuff_109K.txt
29091 16452 S pts/16 00:00:00 python debug_memory.py data/stuff_110K.txt
29097 16552 S pts/16 00:00:00 python debug_memory.py data/stuff_111K.txt
29102 16652 S pts/16 00:00:00 python debug_memory.py data/stuff_112K.txt
29108 16756 S pts/16 00:00:00 python debug_memory.py data/stuff_113K.txt
29113 16852 S pts/16 00:00:00 python debug_memory.py data/stuff_114K.txt
29119 16952 S pts/16 00:00:00 python debug_memory.py data/stuff_115K.txt
29125 17056 S pts/16 00:00:00 python debug_memory.py data/stuff_116K.txt
29130 17156 S pts/16 00:00:00 python debug_memory.py data/stuff_117K.txt
29136 17256 S pts/16 00:00:00 python debug_memory.py data/stuff_118K.txt
29141 17356 S pts/16 00:00:00 python debug_memory.py data/stuff_119K.txt
29147 17452 S pts/16 00:00:00 python debug_memory.py data/stuff_120K.txt
29153 17556 S pts/16 00:00:00 python debug_memory.py data/stuff_121K.txt
29158 17656 S pts/16 00:00:00 python debug_memory.py data/stuff_122K.txt
29164 17756 S pts/16 00:00:00 python debug_memory.py data/stuff_123K.txt
29170 17856 S pts/16 00:00:00 python debug_memory.py data/stuff_124K.txt
29175 17952 S pts/16 00:00:00 python debug_memory.py data/stuff_125K.txt
29181 18056 S pts/16 00:00:00 python debug_memory.py data/stuff_126K.txt
29186 18152 S pts/16 00:00:00 python debug_memory.py data/stuff_127K.txt
29192 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_128K.txt
29198 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_129K.txt
29203 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_130K.txt
29209 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_131K.txt
29215 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_132K.txt
29220 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_133K.txt
29226 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_134K.txt
29231 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_135K.txt
29237 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_136K.txt
29243 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_137K.txt
29248 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_138K.txt
29254 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_139K.txt
29260 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_140K.txt
29265 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_141K.txt
29271 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_142K.txt
29276 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_143K.txt
29282 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_144K.txt
29288 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_145K.txt
29293 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_146K.txt
29299 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_147K.txt
29305 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_148K.txt
29310 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_149K.txt
29316 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_150K.txt
29321 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_151K.txt
29327 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_152K.txt
29333 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_153K.txt
29338 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_154K.txt
29344 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_155K.txt
29349 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_156K.txt
29355 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_157K.txt
29361 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_158K.txt
29366 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_159K.txt
29372 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_160K.txt
29378 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_161K.txt
29383 5460 S pts/16 00:00:00 python debug_memory.py data/stuff_162K.txt
29389 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_163K.txt
29394 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_164K.txt
29400 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_165K.txt
29406 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_166K.txt
29411 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_167K.txt
29417 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_168K.txt
29423 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_169K.txt
29428 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_170K.txt
29434 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_171K.txt
29439 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_172K.txt
29445 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_173K.txt
29451 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_174K.txt
29456 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_175K.txt
29463 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_176K.txt
29483 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_177K.txt
29489 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_178K.txt
29496 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_179K.txt
29501 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_180K.txt
29507 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_181K.txt
29512 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_182K.txt
29518 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_183K.txt
29524 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_184K.txt
29529 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_185K.txt
29535 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_186K.txt
29541 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_187K.txt
29546 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_188K.txt
29552 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_189K.txt
29557 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_190K.txt
29563 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_191K.txt
29569 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_192K.txt
29574 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_193K.txt
29580 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_194K.txt
29586 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_195K.txt
29591 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_196K.txt
29597 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_197K.txt
29602 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_198K.txt
29608 5456 S pts/16 00:00:00 python debug_memory.py data/stuff_199K.txt
29614 5452 S pts/16 00:00:00 python debug_memory.py data/stuff_200K.txt
Can someone explain what is happening? Why do I see an increase in memory usage when using files <128KB?
My full test environment is located here: https://github.com/saltycrane/debugging-python-memory-usage/tree/50f73358c7a84a504333ce9c4071b0f3537bbc0f
I am running Python 2.7.3 on Ubuntu 12.04.
This issue is not specific to working with files <128K in size. I get the same results setting the object attribute to a value the same size as was read from the file. Here is the updated code:
import sys
import time
class MyObj(object):
def __init__(self, size_kb):
self.att = ' ' * int(size_kb) * 1024
def myfunc(size_kb):
mylist = [MyObj(size_kb) for x in xrange(100)]
len(mylist)
return []
def main():
size_kb = sys.argv[1]
myfunc(size_kb)
time.sleep(3600)
if __name__ == '__main__':
main()
Running this script gives similar results. The updated test environment is located here: https://github.com/saltycrane/debugging-python-memory-usage/tree/59b7ff61134dfc11c4195e9201b2c1728ed4fcce
I simplified my test script further by: 1. removing the class and simply creating a list of strings 2. removing myfunc()
and using del
to delete the mylist
object
import sys
import time
def main():
size_kb = sys.argv[1]
mylist = []
for x in xrange(100):
mystr = ' ' * int(size_kb) * 1024
mylist.append(mystr)
del mylist
time.sleep(3600)
if __name__ == '__main__':
main()
My simplified script also gives similar results to the original. However, if I don't create a separate string variable, I don't see an increase in memory. Here is the script that does not create an increase in memory:
import sys
import time
def main():
size_kb = sys.argv[1]
mylist = []
for x in xrange(100):
mylist.append(' ' * int(size_kb) * 1024)
del mylist
time.sleep(3600)
if __name__ == '__main__':
main()
The updated test environment is located here: https://github.com/saltycrane/debugging-python-memory-usage/tree/423ca6a50dccbe32572a9d0dea1068ddcb06663b
More questions:
ps
expected?I discovered some interesting information about "free lists" that seem like they could be related to this issue:
From the last link:
To speed-up memory allocation (and reuse) Python uses a number of lists for small objects. Each list will contain objects of similar size
Indeed: if an item (of size x) is deallocated (freed by lack of reference) its location is not returned to Python’s global memory pool (and even less to the system), but merely marked as free and added to the free list of items of size x.
If small objects memory is never freed, then the inescapable conclusion is that, like goldfishes, these small object lists only keep growing, never shrinking, and that the memory footprint of your application is dominated by the largest number of small objects allocated at any given point.
I oversimplified the code in Update 2. Adding the line del mystr
at the end
of the script freed the memory.
(See: https://github.com/saltycrane/debugging-python-memory-usage/blob/dd058e4774802cae7cbfca520fb835ea46b645e8/debug_memory_leaks.py)
I updated the script to be sufficiently complicated to demonstrate the issue. The issue still exists in the following code. The latest code/environment is located here: https://github.com/saltycrane/debugging-python-memory-usage/tree/fc0c8ce9ba621cb86b6abb93adf1b297a7c0230b
import gc
import sys
import time
def main():
size_kb = sys.argv[1]
mylist = []
for x in xrange(100):
mystr = ' ' * int(size_kb) * 1024
mydict = {'mykey': mystr}
mylist.append(mydict)
del mystr
del mydict
del mylist
gc.collect()
time.sleep(3600)
if __name__ == '__main__':
main()
I also ran the script is some other environments. The strange result was running from within a clean virtualenv. In this case, the memory dropoff occurred at 260KB instead of 128KB. See https://github.com/saltycrane/debugging-python-memory-usage/tree/52fbd5d57ff45affdcd70623ddb74fa1f1ffbbc2
Environments:
More references:
http://hg.python.org/releasing/2.7.3/file/7bb96963d067/Objects/obmalloc.c
After reading some of these, I see a reference to an "arena size" of 256KB. Maybe that is related?
schlenk uncovered the reason the memory usage drops off at 128KB.
128KB is the point at which "memory allocation functions" (malloc?)
use mmap instead of increasing the program break using sbrk.
Interestingly, the threshold can be changed via an environment variable.
I ran a test setting the MALLOC_MMAP_THRESHOLD_
environment variable to
different values and the dropoff in memory usage matched that value.
See here for results:
https://github.com/saltycrane/debugging-python-memory-usage/blob/97d93cd165a139a6b6f96720de63a92561dd2f05/output_debug_memory_leaks.py.txt
I would still like to know if it expected behavior for my script to leak memory for string values < 128KB.
A few more links:
Note: According to the last two links, there is a performance (speed) hit for using mmap instead of sbrk.
The Python program, just like other programming languages, experiences memory leaks. Memory leaks in Python happen if the garbage collector doesn't clean and eliminate the unreferenced or unused data from Python.
Note that every string in Python takes additional 49-80 bytes of memory, where it stores supplementary information, such as hash, length, length in bytes, encoding type and string flags. That's why an empty string takes 49 bytes of memory.
In fact, Python uses more like 35MB of RAM to store these numbers. Why? Because Python integers are objects, and objects have a lot of memory overhead.
You might simply hit the default behaviour of the linux memory allocator.
Basically Linux has two allocation strategies, sbrk() for small blocks of memory and mmap() for larger blocks. sbrk() allocated memory blocks cannot easily be returned to the system, while mmap() based ones can (just unmap the page).
So if you allocate a memory block larger than the value where the malloc() allocator in your libc decides to switch between sbrk() and mmap() you see this effect. See the mallopt() call, especially the MMAP_THRESHOLD (http://man7.org/linux/man-pages/man3/mallopt.3.html).
Update To answer your extra question: yes, it is expected that you leak memory that way, if the memory allocator works like the libc one on Linux. If you used Windows LowFragmentationHeap instead, it would probably not leak, similar on AIX, depending on which malloc is configured. Maybe one of the other allocators (tcmalloc etc.) also fix such issues. sbrk() is blazingly fast, but has issues with memory fragmentation. CPython cannot do much about it, as it does not have a compacting garbage collector, but simple reference counting.
Python offers a few methods to reduce the buffer allocations, see for example the blog post here: http://eli.thegreenplace.net/2011/11/28/less-copies-in-python-with-the-buffer-protocol-and-memoryviews/
I would look into garbage collection. It may be that larger files are triggering garbage collection more frequently, but the small files are being freed but collectively staying at some threshold. Specifically, call gc.collect() and then call gc.get_referrers() on the object to hopefully reveal what is keeping an instance is around. See the Python doc here:
http://docs.python.org/2/library/gc.html?highlight=gc#gc.get_referrers
The issue relates to garbage collection, namespace, and reference counting. The bash script you posted is giving a fairly narrow view of the garbage collector's behaviour. Try a larger range and you will see patterns in how much memory certain ranges will take. For example, change the bash for loop for a larger range, something like: seq 0 16 2056
.
You noticed the memory usage was reduced if you del mystr
because you are removing any references left to it. Similar results would likely happen if you limited the mystr variable to it's own function like so:
def loopy():
mylist = []
for x in xrange(100):
mystr = ' ' * int(size_kb) * 1024
mydict = {x: mystr}
mylist.append(mydict)
return mylist
Rather than using bash scripts, I think you could get more useful information using a memory profiler. Here are a couple examples using Pympler. This first version is similar to your code from Update 3:
import gc
import sys
import time
from pympler import tracker
tr = tracker.SummaryTracker()
print 'begin:'
tr.print_diff()
size_kb = sys.argv[1]
mylist = []
mydict = {}
print 'empty list & dict:'
tr.print_diff()
for x in xrange(100):
mystr = ' ' * int(size_kb) * 1024
mydict = {x: mystr}
mylist.append(mydict)
print 'after for loop:'
tr.print_diff()
del mystr
del mydict
del mylist
print 'after deleting stuff:'
tr.print_diff()
collected = gc.collect()
print 'after garbage collection (collected: %d):' % collected
tr.print_diff()
time.sleep(2)
print 'took a short nap after all that work:'
tr.print_diff()
mylist = []
print 'create an empty list for some reason:'
tr.print_diff()
And the output:
$ python mem_test.py 256
begin:
types | # objects | total size
======================= | =========== | =============
list | 957 | 97.44 KB
str | 951 | 53.65 KB
int | 118 | 2.77 KB
wrapper_descriptor | 8 | 640 B
weakref | 3 | 264 B
member_descriptor | 2 | 144 B
getset_descriptor | 2 | 144 B
function (store_info) | 1 | 120 B
cell | 2 | 112 B
instancemethod | -1 | -80 B
_sre.SRE_Pattern | -2 | -176 B
tuple | -1 | -216 B
dict | 2 | -1744 B
empty list & dict:
types | # objects | total size
======= | =========== | ============
list | 2 | 168 B
str | 2 | 97 B
int | 1 | 24 B
after for loop:
types | # objects | total size
======= | =========== | ============
str | 1 | 256.04 KB
list | 0 | 848 B
after deleting stuff:
types | # objects | total size
======= | =========== | ===============
list | -1 | -920 B
str | -1 | -262181 B
after garbage collection (collected: 0):
types | # objects | total size
======= | =========== | ============
took a short nap after all that work:
types | # objects | total size
======= | =========== | ============
create an empty list for some reason:
types | # objects | total size
======= | =========== | ============
list | 1 | 72 B
Notice after the for loop the total size for the str class is 256 KB, essentially the same as the argument I passed to it. After explicitly removing the reference to mystr in del mystr
the memory is freed. After this, the garbage has already been picked up so there's no further reduction after gc.collect()
.
The next version uses a function to create a different namespace for the string.
import gc
import sys
import time
from pympler import tracker
def loopy():
mylist = []
for x in xrange(100):
mystr = ' ' * int(size_kb) * 1024
mydict = {x: mystr}
mylist.append(mydict)
return mylist
tr = tracker.SummaryTracker()
print 'begin:'
tr.print_diff()
size_kb = sys.argv[1]
mylist = loopy()
print 'after for loop:'
tr.print_diff()
del mylist
print 'after deleting stuff:'
tr.print_diff()
collected = gc.collect()
print 'after garbage collection (collected: %d):' % collected
tr.print_diff()
time.sleep(2)
print 'took a short nap after all that work:'
tr.print_diff()
mylist = []
print 'create an empty list for some reason:'
tr.print_diff()
And finally the output from this version:
$ python mem_test_2.py 256
begin:
types | # objects | total size
======================= | =========== | =============
list | 958 | 97.53 KB
str | 952 | 53.70 KB
int | 118 | 2.77 KB
wrapper_descriptor | 8 | 640 B
weakref | 3 | 264 B
member_descriptor | 2 | 144 B
getset_descriptor | 2 | 144 B
function (store_info) | 1 | 120 B
cell | 2 | 112 B
instancemethod | -1 | -80 B
_sre.SRE_Pattern | -2 | -176 B
tuple | -1 | -216 B
dict | 2 | -1744 B
after for loop:
types | # objects | total size
======= | =========== | ============
list | 2 | 1016 B
str | 2 | 97 B
int | 1 | 24 B
after deleting stuff:
types | # objects | total size
======= | =========== | ============
list | -1 | -920 B
after garbage collection (collected: 0):
types | # objects | total size
======= | =========== | ============
took a short nap after all that work:
types | # objects | total size
======= | =========== | ============
create an empty list for some reason:
types | # objects | total size
======= | =========== | ============
list | 1 | 72 B
Now, we don't have to clean up the str, and I think this example shows why using functions are a good idea. Generating code where there's one big chunk in one namespace is really preventing the garbage collector from doing it's job. It will not come into your house and start assuming things are trash :) It has to know that things are safe to collect.
That Evan Jones link is very interesting btw.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With