Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python memory not being released on linux?

I am trying to load a large json object into memory and then perform some operations with the data. However, I am noticing a large increase in RAM after the json file is read -EVEN AFTER the object is out of scope.

Here is the code

import json
import objgraph
import gc
from memory_profiler import profile
@profile
def open_stuff():
    with open("bigjson.json", 'r') as jsonfile:
        d= jsonfile.read()
        jsonobj = json.loads(d)
        objgraph.show_most_common_types()
        del jsonobj
        del d
    print ('d')
    gc.collect()

open_stuff()

I tried running this script in Windows with Python version 2.7.12 and Debian 9 with Python version 2.7.13, and I am seeing an issue with the Python in Linux.

In Windows, when I run the script, it uses up a lot of RAM while the json object is being read and in scope (as expected), but it is released after the operation is done (as expected).

list                       3039184
dict                       413840
function                   2200
wrapper_descriptor         1199
builtin_function_or_method 819
method_descriptor          651
tuple                      617
weakref                    554
getset_descriptor          362
member_descriptor          250
d
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     16.9 MiB     16.9 MiB   @profile
     6                             def open_stuff():
     7     16.9 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    197.9 MiB    181.0 MiB           d= jsonfile.read()
     9   1393.4 MiB   1195.5 MiB           jsonobj = json.loads(d)
    10   1397.0 MiB      3.6 MiB           objgraph.show_most_common_types()
    11    402.8 MiB   -994.2 MiB           del jsonobj
    12    221.8 MiB   -181.0 MiB           del d
    13    221.8 MiB      0.0 MiB       print ('d')
    14     23.3 MiB   -198.5 MiB       gc.collect()

However in the LINUX environment, over 500MB of RAM is still used even though all references to the JSON object has been deleted.

list                       3039186
dict                       413836
function                   2336
wrapper_descriptor         1193
builtin_function_or_method 765
method_descriptor          651
tuple                      514
weakref                    480
property                   273
member_descriptor          250
d
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     14.2 MiB     14.2 MiB   @profile
     6                             def open_stuff():
     7     14.2 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    195.1 MiB    181.0 MiB           d= jsonfile.read()
     9   1466.4 MiB   1271.3 MiB           jsonobj = json.loads(d)
    10   1466.8 MiB      0.4 MiB           objgraph.show_most_common_types()
    11    694.8 MiB   -772.1 MiB           del jsonobj
    12    513.8 MiB   -181.0 MiB           del d
    13    513.8 MiB      0.0 MiB       print ('d')
    14    513.0 MiB     -0.8 MiB       gc.collect()

The same script run in Debian 9 with Python 3.5.3 uses less RAM but leaks a proportionate amount of RAM.

list                       3039266
dict                       414638
function                   3374
tuple                      1254
wrapper_descriptor         1076
weakref                    944
builtin_function_or_method 780
method_descriptor          780
getset_descriptor          477
type                       431
d
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     17.2 MiB     17.2 MiB   @profile
     6                             def open_stuff():
     7     17.2 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    198.3 MiB    181.1 MiB           d= jsonfile.read()
     9   1057.7 MiB    859.4 MiB           jsonobj = json.loads(d)
    10   1058.1 MiB      0.4 MiB           objgraph.show_most_common_types()
    11    537.5 MiB   -520.6 MiB           del jsonobj
    12    356.5 MiB   -181.0 MiB           del d
    13    356.5 MiB      0.0 MiB       print ('d')
    14    355.8 MiB     -0.8 MiB       gc.collect()

What is causing this issue? Both versions of Python are running 64bit versions.

EDIT - calling that function several times in a row leads to even stranger data, the json.loads function uses less RAM each time it's called, after the 3rd try the RAM usage stabilizes, but the earlier leaked RAM does not get released..

list                       3039189
dict                       413840
function                   2339
wrapper_descriptor         1193
builtin_function_or_method 765
method_descriptor          651
tuple                      517
weakref                    480
property                   273
member_descriptor          250
d
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     14.5 MiB     14.5 MiB   @profile
     6                             def open_stuff():
     7     14.5 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    195.4 MiB    180.9 MiB           d= jsonfile.read()
     9   1466.5 MiB   1271.1 MiB           jsonobj = json.loads(d)
    10   1466.9 MiB      0.4 MiB           objgraph.show_most_common_types()
    11    694.8 MiB   -772.1 MiB           del jsonobj
    12    513.9 MiB   -181.0 MiB           del d
    13    513.9 MiB      0.0 MiB       print ('d')
    14    513.1 MiB     -0.8 MiB       gc.collect()


list                       3039189
dict                       413842
function                   2339
wrapper_descriptor         1202
builtin_function_or_method 765
method_descriptor          651
tuple                      517
weakref                    482
property                   273
member_descriptor          253
d
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
     5    513.1 MiB    513.1 MiB   @profile
     6                             def open_stuff():
     7    513.1 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    513.1 MiB      0.0 MiB           d= jsonfile.read()
     9   1466.8 MiB    953.7 MiB           jsonobj = json.loads(d)
    10   1493.3 MiB     26.6 MiB           objgraph.show_most_common_types()
    11    723.9 MiB   -769.4 MiB           del jsonobj
    12    723.9 MiB      0.0 MiB           del d
    13    723.9 MiB      0.0 MiB       print ('d')
    14    722.4 MiB     -1.5 MiB       gc.collect()


list                       3039189
dict                       413842
function                   2339
wrapper_descriptor         1202
builtin_function_or_method 765
method_descriptor          651
tuple                      517
weakref                    482
property                   273
member_descriptor          253
d
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
     5    722.4 MiB    722.4 MiB   @profile
     6                             def open_stuff():
     7    722.4 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    722.4 MiB      0.0 MiB           d= jsonfile.read()
     9   1493.1 MiB    770.8 MiB           jsonobj = json.loads(d)
    10   1493.4 MiB      0.3 MiB           objgraph.show_most_common_types()
    11    724.4 MiB   -769.0 MiB           del jsonobj
    12    724.4 MiB      0.0 MiB           del d
    13    724.4 MiB      0.0 MiB       print ('d')
    14    722.9 MiB     -1.5 MiB       gc.collect()


Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
================================================
    17     14.2 MiB     14.2 MiB   @profile
    18                             def wow():
    19    513.1 MiB    498.9 MiB       open_stuff()
    20    722.4 MiB    209.3 MiB       open_stuff()
    21    722.9 MiB      0.6 MiB       open_stuff()

EDIT 2: Someone suggested this is a duplicate of Why does my program's memory not release? , but the amount of memory in question is far from the "small pages" discussed in the other question.

like image 885
lonewaft Avatar asked Aug 20 '18 21:08

lonewaft


People also ask

Does your Python program have a memory leak problem?

If too much-unused data is stored up in your program and you forget to remove it, then your python program can face the problem of memory leaks. In such a situation, it is required for you to remove that memory leak in the python program. Now the question that appears in every person’s mind is that “how can this be achieved.”

How much memory does a Python process use?

We can get this information using the handy psutil library, checking the resident memory of the current process: With this particular measurement, we’re using 3083MB, or 3.08GB, and the difference from the array size is no doubt the memory used by the Python interpreter and the libraries we’ve imported.

How do you use virtual memory in Python?

Virtual memory is a process-specific address space, essentially numbers from 0 to 264-1, where the process can read or write bytes. In a C program you might use APIs like malloc () or mmap () to do so; in Python you just create objects, and the Python interpreter will call malloc () or mmap () when necessary.

How to debug memory usages in Python?

In this method, you will have to debug memory usages in python programming using the garbage collector inbuilt module. It will give you a list of all objects known by the garbage collectors. The debug method lets the user see where enough of the Python memory is being used or applied.


2 Answers

while python freed memory back to glibc, glibc will not release back to OS immediately every time, since the user may request memory later. you could call glibc's malloc_trim(3) to make an attempt to release memory back:

import ctypes

def malloc_trim():
    ctypes.CDLL('libc.so.6').malloc_trim(0) 

@profile
def load():
    with open('big.json') as f:
        d = json.load(f)
    del d
    malloc_trim()

result:

Line #    Mem usage    Increment   Line Contents
================================================
    27     11.6 MiB     11.6 MiB   @profile
    28                             def load():
    29     11.6 MiB      0.0 MiB       with open('big.json') as f:
    30    166.5 MiB    154.9 MiB           d = json.load(f)
    31     44.1 MiB   -122.4 MiB       del d
    32     12.7 MiB    -31.4 MiB       malloc_trim()
like image 168
georgexsh Avatar answered Oct 15 '22 01:10

georgexsh


The linked duplicate likely hints at what your problem is, but let's go into a bit more detail.

First, you should use json.load instead of loading the file entirely into memory and then doing json.loads on that:

with open('bigjson.json') as f:
    data = json.load(f)

This allows the decoder to read the file in at its own leisure and will most likely reduce memory usage. In your original version, you had to at least store the entire original file in memory before you could even start parsing the JSON. This allows the file to be streamed as the decoder needs it.

Also I see you're using Python 2.7. Any particular reason why? dicts have seen a lot of updates in 3, particular ones which drastically reduce memory usage. Perhaps consider benchmarking against 3 as well if memory usage is this much of a problem.


The problem you're running into here isn't that the memory isn't being released.

The "mem usage" column likely is an indication of the program's RSS (which roughly is the amount of memory available to a process without needing to ask the OS for more space). The README for memory_profiler don't seem to indicate this precisely, but they make some vague statements that would suggest this: "the second column (Mem usage) the memory usage of the Python interpreter after that line has been executed."

Assuming this, we see that in all OSes, after the json dict is reclaimed the RSS of the program is halved (suspicious, isn't it? we'll get to that a bit later). That's because there are a lot of layers here. Roughly, we have:

Your code -> Python Runtime/GC -> userland allocator -> (syscall) -> Operating System -> Physical RAM

When something goes out of scope it can be released from the perspective of your code. The Python GC makes no guarantees as to when this happens, but if you call gc.collect() and the objects are out of scope (have 0 refcount), then they should indeed be released by the Python runtime. But, that gives memory back to the userland allocator. This may or may not give memory back to the OS. We see it do that after we reclaim jsonobj in all OSes. But instead of giving back everything it just about halves the memory usage. That should raise a red flag since that magic halved number appeared nowhere else. This is a good indication that the userland allocator is doing some work here.

Recalling some basic datastructures, a vector (a dynamically sized, growable and shrinkable array) typically starts as a NULL pointer. Then, as you append elements to it, it grows. We usually grow vectors by doubling their size, because this gives desirable amortized performance. Inserts will on average take constant time regardless of the final length of the vector. (And the same for any deletes, which may cause a shrinkage by a factor of 2)

The memory allocator underneath Python's GC is likely employing an approach similar to this. Instead of reclaiming all of the memory used, it is taking a guess that later you may need at least half of that again. If you don't, then yes it is indeed retaining too much (but not leaking). But if you do (and memory usage for things like webservers is often bursty like this), then this guess saves you allocation time in the future (which at this level is a syscall).

In your benchmark where you run the code multiple times, you see this behavior. It keeps enough memory around such that the initial jsonfile.read() can fit into memory without needing to ask for more. If there was a bug somewhere (there was a memory leak), you'd see the memory usage trending upwards over time. I don't think your data looks like this. For example, see the graph from another featured Python question. That is what a memory leak looks like.

If you want to be doubly sure, you could run your script with valgrind. This will confirm for you whether there is a memory leak anywhere in userland. But, I suspect this is not the case.

edit: as an aside, if you're dealing with files this large, perhaps JSON isn't the proper format to be storing them in. Something that you can stream is likely to be much more memory friendly (python generators are great for this). If the JSON format is unavoidable and this memory usage is indeed a problem, you'll likely want to work in a language that gives you more fine grained control over memory layout and allocations like C, C++, or Rust. A fine tuned C struct representing your data likely can do a much better job of packing your data than Python dict (especially a 2.7 dict). Furthermore, if you're doing this operation frequently, you could mmap the file (perhaps dump a wire format to the file so when mmapped in you can just read directly from it). Or, load it in once and let the OS deal with it. High memory usage isn't a problem, because most OSes are very good at paging out memory when it is less frequently accessed.

like image 31
Bailey Parker Avatar answered Oct 15 '22 01:10

Bailey Parker