Python memory not being released on linux?

Tags:

I am trying to load a large json object into memory and then perform some operations with the data. However, I am noticing a large increase in RAM after the json file is read -EVEN AFTER the object is out of scope.

Here is the code

import json
import objgraph
import gc
from memory_profiler import profile
@profile
def open_stuff():
    with open("bigjson.json", 'r') as jsonfile:
        d= jsonfile.read()
        jsonobj = json.loads(d)
        objgraph.show_most_common_types()
        del jsonobj
        del d
    print ('d')
    gc.collect()

open_stuff()

I tried running this script in Windows with Python version 2.7.12 and Debian 9 with Python version 2.7.13, and I am seeing an issue with the Python in Linux.

In Windows, when I run the script, it uses up a lot of RAM while the json object is being read and in scope (as expected), but it is released after the operation is done (as expected).

list                       3039184
dict                       413840
function                   2200
wrapper_descriptor         1199
builtin_function_or_method 819
method_descriptor          651
tuple                      617
weakref                    554
getset_descriptor          362
member_descriptor          250
d
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     16.9 MiB     16.9 MiB   @profile
     6                             def open_stuff():
     7     16.9 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    197.9 MiB    181.0 MiB           d= jsonfile.read()
     9   1393.4 MiB   1195.5 MiB           jsonobj = json.loads(d)
    10   1397.0 MiB      3.6 MiB           objgraph.show_most_common_types()
    11    402.8 MiB   -994.2 MiB           del jsonobj
    12    221.8 MiB   -181.0 MiB           del d
    13    221.8 MiB      0.0 MiB       print ('d')
    14     23.3 MiB   -198.5 MiB       gc.collect()

However in the LINUX environment, over 500MB of RAM is still used even though all references to the JSON object has been deleted.

list                       3039186
dict                       413836
function                   2336
wrapper_descriptor         1193
builtin_function_or_method 765
method_descriptor          651
tuple                      514
weakref                    480
property                   273
member_descriptor          250
d
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     14.2 MiB     14.2 MiB   @profile
     6                             def open_stuff():
     7     14.2 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    195.1 MiB    181.0 MiB           d= jsonfile.read()
     9   1466.4 MiB   1271.3 MiB           jsonobj = json.loads(d)
    10   1466.8 MiB      0.4 MiB           objgraph.show_most_common_types()
    11    694.8 MiB   -772.1 MiB           del jsonobj
    12    513.8 MiB   -181.0 MiB           del d
    13    513.8 MiB      0.0 MiB       print ('d')
    14    513.0 MiB     -0.8 MiB       gc.collect()

The same script run in Debian 9 with Python 3.5.3 uses less RAM but leaks a proportionate amount of RAM.

list                       3039266
dict                       414638
function                   3374
tuple                      1254
wrapper_descriptor         1076
weakref                    944
builtin_function_or_method 780
method_descriptor          780
getset_descriptor          477
type                       431
d
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     17.2 MiB     17.2 MiB   @profile
     6                             def open_stuff():
     7     17.2 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    198.3 MiB    181.1 MiB           d= jsonfile.read()
     9   1057.7 MiB    859.4 MiB           jsonobj = json.loads(d)
    10   1058.1 MiB      0.4 MiB           objgraph.show_most_common_types()
    11    537.5 MiB   -520.6 MiB           del jsonobj
    12    356.5 MiB   -181.0 MiB           del d
    13    356.5 MiB      0.0 MiB       print ('d')
    14    355.8 MiB     -0.8 MiB       gc.collect()

What is causing this issue? Both versions of Python are running 64bit versions.

EDIT - calling that function several times in a row leads to even stranger data, the json.loads function uses less RAM each time it's called, after the 3rd try the RAM usage stabilizes, but the earlier leaked RAM does not get released..

list                       3039189
dict                       413840
function                   2339
wrapper_descriptor         1193
builtin_function_or_method 765
method_descriptor          651
tuple                      517
weakref                    480
property                   273
member_descriptor          250
d
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     14.5 MiB     14.5 MiB   @profile
     6                             def open_stuff():
     7     14.5 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    195.4 MiB    180.9 MiB           d= jsonfile.read()
     9   1466.5 MiB   1271.1 MiB           jsonobj = json.loads(d)
    10   1466.9 MiB      0.4 MiB           objgraph.show_most_common_types()
    11    694.8 MiB   -772.1 MiB           del jsonobj
    12    513.9 MiB   -181.0 MiB           del d
    13    513.9 MiB      0.0 MiB       print ('d')
    14    513.1 MiB     -0.8 MiB       gc.collect()


list                       3039189
dict                       413842
function                   2339
wrapper_descriptor         1202
builtin_function_or_method 765
method_descriptor          651
tuple                      517
weakref                    482
property                   273
member_descriptor          253
d
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
     5    513.1 MiB    513.1 MiB   @profile
     6                             def open_stuff():
     7    513.1 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    513.1 MiB      0.0 MiB           d= jsonfile.read()
     9   1466.8 MiB    953.7 MiB           jsonobj = json.loads(d)
    10   1493.3 MiB     26.6 MiB           objgraph.show_most_common_types()
    11    723.9 MiB   -769.4 MiB           del jsonobj
    12    723.9 MiB      0.0 MiB           del d
    13    723.9 MiB      0.0 MiB       print ('d')
    14    722.4 MiB     -1.5 MiB       gc.collect()


list                       3039189
dict                       413842
function                   2339
wrapper_descriptor         1202
builtin_function_or_method 765
method_descriptor          651
tuple                      517
weakref                    482
property                   273
member_descriptor          253
d
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
     5    722.4 MiB    722.4 MiB   @profile
     6                             def open_stuff():
     7    722.4 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    722.4 MiB      0.0 MiB           d= jsonfile.read()
     9   1493.1 MiB    770.8 MiB           jsonobj = json.loads(d)
    10   1493.4 MiB      0.3 MiB           objgraph.show_most_common_types()
    11    724.4 MiB   -769.0 MiB           del jsonobj
    12    724.4 MiB      0.0 MiB           del d
    13    724.4 MiB      0.0 MiB       print ('d')
    14    722.9 MiB     -1.5 MiB       gc.collect()


Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
    17     14.2 MiB     14.2 MiB   @profile
    18                             def wow():
    19    513.1 MiB    498.9 MiB       open_stuff()
    20    722.4 MiB    209.3 MiB       open_stuff()
    21    722.9 MiB      0.6 MiB       open_stuff()

EDIT 2: Someone suggested this is a duplicate of Why does my program's memory not release? , but the amount of memory in question is far from the "small pages" discussed in the other question.

885

asked Aug 20 '18 21:08

lonewaft

2 Answers

while python freed memory back to glibc, glibc will not release back to OS immediately every time, since the user may request memory later. you could call glibc's malloc_trim(3) to make an attempt to release memory back:

import ctypes

def malloc_trim():
    ctypes.CDLL('libc.so.6').malloc_trim(0) 

@profile
def load():
    with open('big.json') as f:
        d = json.load(f)
    del d
    malloc_trim()

result:

Line #    Mem usage    Increment   Line Contents
================================================
    27     11.6 MiB     11.6 MiB   @profile
    28                             def load():
    29     11.6 MiB      0.0 MiB       with open('big.json') as f:
    30    166.5 MiB    154.9 MiB           d = json.load(f)
    31     44.1 MiB   -122.4 MiB       del d
    32     12.7 MiB    -31.4 MiB       malloc_trim()

168

answered Oct 15 '22 01:10

georgexsh

The linked duplicate likely hints at what your problem is, but let's go into a bit more detail.

~~First, you should use json.load instead of loading the file entirely into memory and then doing json.loads on that:~~

with open('bigjson.json') as f:
    data = json.load(f)

This allows the decoder to read the file in at its own leisure and will most likely reduce memory usage. In your original version, you had to at least store the entire original file in memory before you could even start parsing the JSON. This allows the file to be streamed as the decoder needs it.

Also I see you're using Python 2.7. Any particular reason why? dicts have seen a lot of updates in 3, particular ones which drastically reduce memory usage. Perhaps consider benchmarking against 3 as well if memory usage is this much of a problem.

The problem you're running into here isn't that the memory isn't being released.

The "mem usage" column likely is an indication of the program's RSS (which roughly is the amount of memory available to a process without needing to ask the OS for more space). The README for memory_profiler don't seem to indicate this precisely, but they make some vague statements that would suggest this: "the second column (Mem usage) the memory usage of the Python interpreter after that line has been executed."

Assuming this, we see that in all OSes, after the json dict is reclaimed the RSS of the program is halved (suspicious, isn't it? we'll get to that a bit later). That's because there are a lot of layers here. Roughly, we have:

Your code -> Python Runtime/GC -> userland allocator -> (syscall) -> Operating System -> Physical RAM

When something goes out of scope it can be released from the perspective of your code. The Python GC makes no guarantees as to when this happens, but if you call gc.collect() and the objects are out of scope (have 0 refcount), then they should indeed be released by the Python runtime. But, that gives memory back to the userland allocator. This may or may not give memory back to the OS. We see it do that after we reclaim jsonobj in all OSes. But instead of giving back everything it just about halves the memory usage. That should raise a red flag since that magic halved number appeared nowhere else. This is a good indication that the userland allocator is doing some work here.

Recalling some basic datastructures, a vector (a dynamically sized, growable and shrinkable array) typically starts as a NULL pointer. Then, as you append elements to it, it grows. We usually grow vectors by doubling their size, because this gives desirable amortized performance. Inserts will on average take constant time regardless of the final length of the vector. (And the same for any deletes, which may cause a shrinkage by a factor of 2)

The memory allocator underneath Python's GC is likely employing an approach similar to this. Instead of reclaiming all of the memory used, it is taking a guess that later you may need at least half of that again. If you don't, then yes it is indeed retaining too much (but not leaking). But if you do (and memory usage for things like webservers is often bursty like this), then this guess saves you allocation time in the future (which at this level is a syscall).

In your benchmark where you run the code multiple times, you see this behavior. It keeps enough memory around such that the initial jsonfile.read() can fit into memory without needing to ask for more. If there was a bug somewhere (there was a memory leak), you'd see the memory usage trending upwards over time. I don't think your data looks like this. For example, see the graph from another featured Python question. That is what a memory leak looks like.

If you want to be doubly sure, you could run your script with valgrind. This will confirm for you whether there is a memory leak anywhere in userland. But, I suspect this is not the case.

edit: as an aside, if you're dealing with files this large, perhaps JSON isn't the proper format to be storing them in. Something that you can stream is likely to be much more memory friendly (python generators are great for this). If the JSON format is unavoidable and this memory usage is indeed a problem, you'll likely want to work in a language that gives you more fine grained control over memory layout and allocations like C, C++, or Rust. A fine tuned C struct representing your data likely can do a much better job of packing your data than Python dict (especially a 2.7 dict). Furthermore, if you're doing this operation frequently, you could mmap the file (perhaps dump a wire format to the file so when mmapped in you can just read directly from it). Or, load it in once and let the OS deal with it. High memory usage isn't a problem, because most OSes are very good at paging out memory when it is less frequently accessed.

answered Oct 15 '22 01:10

Bailey Parker

Related questions
                            
                                Pandas json_normalize and null values in JSON
                            
                                Difference between pip3 and `python3 setup.py install` regarding cmdclass argument
                            
                                How to mock uuid generation in a test case?
                            
                                What is the default Celery log level if none is specified?
                            
                                reading a WAV file from TIMIT database in python
                            
                                How to retrieve an Enum key via variable
                            
                                EOF marker not found while use PyPDF2 merge pdf file in python
                            
                                Django - Signature of method does not match signature of base method in class
                            
                                Is there a way to adjust shutter speed or exposure time of a webcam using Python and OpenCV
                            
                                Configure lru_cache for class and static methods
                            
                                Variable not found. Declare it as envvar or define a default value
                            
                                One-hot encoding multi-level column data
                            
                                Oversampling functionality in Tensorflow dataset API
                            
                                How to pip install *.whl on Windows (using a wildcard)
                            
                                Mask out sensitive information in python log
                            
                                Print all columns and rows of a numpy array [duplicate]
                            
                                Django save previous object from models
                            
                                Randomly sample from multiple tf.data.Datasets in Tensorflow
                            
                                What's the right way to insert a CalibratedClassifierCV in a scikit-learn pipeline?
                            
                                Drop duplicates keeping the row with the highest value in another column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python memory not being released on linux?

Tags:

python

linux

memory-leaks

lonewaft

People also ask

2 Answers

georgexsh

Bailey Parker

Recent Activity

Donate For Us