I am trying to load a large json object into memory and then perform some operations with the data. However, I am noticing a large increase in RAM after the json file is read -EVEN AFTER the object is out of scope.
Here is the code
import json
import objgraph
import gc
from memory_profiler import profile
@profile
def open_stuff():
with open("bigjson.json", 'r') as jsonfile:
d= jsonfile.read()
jsonobj = json.loads(d)
objgraph.show_most_common_types()
del jsonobj
del d
print ('d')
gc.collect()
open_stuff()
I tried running this script in Windows with Python version 2.7.12 and Debian 9 with Python version 2.7.13, and I am seeing an issue with the Python in Linux.
In Windows, when I run the script, it uses up a lot of RAM while the json object is being read and in scope (as expected), but it is released after the operation is done (as expected).
list 3039184
dict 413840
function 2200
wrapper_descriptor 1199
builtin_function_or_method 819
method_descriptor 651
tuple 617
weakref 554
getset_descriptor 362
member_descriptor 250
d
Filename: testjson.py
Line # Mem usage Increment Line Contents
================================================
5 16.9 MiB 16.9 MiB @profile
6 def open_stuff():
7 16.9 MiB 0.0 MiB with open("bigjson.json", 'r') as jsonfile:
8 197.9 MiB 181.0 MiB d= jsonfile.read()
9 1393.4 MiB 1195.5 MiB jsonobj = json.loads(d)
10 1397.0 MiB 3.6 MiB objgraph.show_most_common_types()
11 402.8 MiB -994.2 MiB del jsonobj
12 221.8 MiB -181.0 MiB del d
13 221.8 MiB 0.0 MiB print ('d')
14 23.3 MiB -198.5 MiB gc.collect()
However in the LINUX environment, over 500MB of RAM is still used even though all references to the JSON object has been deleted.
list 3039186
dict 413836
function 2336
wrapper_descriptor 1193
builtin_function_or_method 765
method_descriptor 651
tuple 514
weakref 480
property 273
member_descriptor 250
d
Filename: testjson.py
Line # Mem usage Increment Line Contents
================================================
5 14.2 MiB 14.2 MiB @profile
6 def open_stuff():
7 14.2 MiB 0.0 MiB with open("bigjson.json", 'r') as jsonfile:
8 195.1 MiB 181.0 MiB d= jsonfile.read()
9 1466.4 MiB 1271.3 MiB jsonobj = json.loads(d)
10 1466.8 MiB 0.4 MiB objgraph.show_most_common_types()
11 694.8 MiB -772.1 MiB del jsonobj
12 513.8 MiB -181.0 MiB del d
13 513.8 MiB 0.0 MiB print ('d')
14 513.0 MiB -0.8 MiB gc.collect()
The same script run in Debian 9 with Python 3.5.3 uses less RAM but leaks a proportionate amount of RAM.
list 3039266
dict 414638
function 3374
tuple 1254
wrapper_descriptor 1076
weakref 944
builtin_function_or_method 780
method_descriptor 780
getset_descriptor 477
type 431
d
Filename: testjson.py
Line # Mem usage Increment Line Contents
================================================
5 17.2 MiB 17.2 MiB @profile
6 def open_stuff():
7 17.2 MiB 0.0 MiB with open("bigjson.json", 'r') as jsonfile:
8 198.3 MiB 181.1 MiB d= jsonfile.read()
9 1057.7 MiB 859.4 MiB jsonobj = json.loads(d)
10 1058.1 MiB 0.4 MiB objgraph.show_most_common_types()
11 537.5 MiB -520.6 MiB del jsonobj
12 356.5 MiB -181.0 MiB del d
13 356.5 MiB 0.0 MiB print ('d')
14 355.8 MiB -0.8 MiB gc.collect()
What is causing this issue? Both versions of Python are running 64bit versions.
EDIT - calling that function several times in a row leads to even stranger data, the json.loads function uses less RAM each time it's called, after the 3rd try the RAM usage stabilizes, but the earlier leaked RAM does not get released..
list 3039189
dict 413840
function 2339
wrapper_descriptor 1193
builtin_function_or_method 765
method_descriptor 651
tuple 517
weakref 480
property 273
member_descriptor 250
d
Filename: testjson.py
Line # Mem usage Increment Line Contents
================================================
5 14.5 MiB 14.5 MiB @profile
6 def open_stuff():
7 14.5 MiB 0.0 MiB with open("bigjson.json", 'r') as jsonfile:
8 195.4 MiB 180.9 MiB d= jsonfile.read()
9 1466.5 MiB 1271.1 MiB jsonobj = json.loads(d)
10 1466.9 MiB 0.4 MiB objgraph.show_most_common_types()
11 694.8 MiB -772.1 MiB del jsonobj
12 513.9 MiB -181.0 MiB del d
13 513.9 MiB 0.0 MiB print ('d')
14 513.1 MiB -0.8 MiB gc.collect()
list 3039189
dict 413842
function 2339
wrapper_descriptor 1202
builtin_function_or_method 765
method_descriptor 651
tuple 517
weakref 482
property 273
member_descriptor 253
d
Filename: testjson.py
Line # Mem usage Increment Line Contents
================================================
5 513.1 MiB 513.1 MiB @profile
6 def open_stuff():
7 513.1 MiB 0.0 MiB with open("bigjson.json", 'r') as jsonfile:
8 513.1 MiB 0.0 MiB d= jsonfile.read()
9 1466.8 MiB 953.7 MiB jsonobj = json.loads(d)
10 1493.3 MiB 26.6 MiB objgraph.show_most_common_types()
11 723.9 MiB -769.4 MiB del jsonobj
12 723.9 MiB 0.0 MiB del d
13 723.9 MiB 0.0 MiB print ('d')
14 722.4 MiB -1.5 MiB gc.collect()
list 3039189
dict 413842
function 2339
wrapper_descriptor 1202
builtin_function_or_method 765
method_descriptor 651
tuple 517
weakref 482
property 273
member_descriptor 253
d
Filename: testjson.py
Line # Mem usage Increment Line Contents
================================================
5 722.4 MiB 722.4 MiB @profile
6 def open_stuff():
7 722.4 MiB 0.0 MiB with open("bigjson.json", 'r') as jsonfile:
8 722.4 MiB 0.0 MiB d= jsonfile.read()
9 1493.1 MiB 770.8 MiB jsonobj = json.loads(d)
10 1493.4 MiB 0.3 MiB objgraph.show_most_common_types()
11 724.4 MiB -769.0 MiB del jsonobj
12 724.4 MiB 0.0 MiB del d
13 724.4 MiB 0.0 MiB print ('d')
14 722.9 MiB -1.5 MiB gc.collect()
Filename: testjson.py
Line # Mem usage Increment Line Contents
================================================
17 14.2 MiB 14.2 MiB @profile
18 def wow():
19 513.1 MiB 498.9 MiB open_stuff()
20 722.4 MiB 209.3 MiB open_stuff()
21 722.9 MiB 0.6 MiB open_stuff()
EDIT 2: Someone suggested this is a duplicate of Why does my program's memory not release? , but the amount of memory in question is far from the "small pages" discussed in the other question.
If too much-unused data is stored up in your program and you forget to remove it, then your python program can face the problem of memory leaks. In such a situation, it is required for you to remove that memory leak in the python program. Now the question that appears in every person’s mind is that “how can this be achieved.”
We can get this information using the handy psutil library, checking the resident memory of the current process: With this particular measurement, we’re using 3083MB, or 3.08GB, and the difference from the array size is no doubt the memory used by the Python interpreter and the libraries we’ve imported.
Virtual memory is a process-specific address space, essentially numbers from 0 to 264-1, where the process can read or write bytes. In a C program you might use APIs like malloc () or mmap () to do so; in Python you just create objects, and the Python interpreter will call malloc () or mmap () when necessary.
In this method, you will have to debug memory usages in python programming using the garbage collector inbuilt module. It will give you a list of all objects known by the garbage collectors. The debug method lets the user see where enough of the Python memory is being used or applied.
while python freed memory back to glibc, glibc will not release back to OS immediately every time, since the user may request memory later. you could call glibc's malloc_trim(3)
to make an attempt to release memory back:
import ctypes
def malloc_trim():
ctypes.CDLL('libc.so.6').malloc_trim(0)
@profile
def load():
with open('big.json') as f:
d = json.load(f)
del d
malloc_trim()
result:
Line # Mem usage Increment Line Contents
================================================
27 11.6 MiB 11.6 MiB @profile
28 def load():
29 11.6 MiB 0.0 MiB with open('big.json') as f:
30 166.5 MiB 154.9 MiB d = json.load(f)
31 44.1 MiB -122.4 MiB del d
32 12.7 MiB -31.4 MiB malloc_trim()
The linked duplicate likely hints at what your problem is, but let's go into a bit more detail.
First, you should use json.load
instead of loading the file entirely into memory and then doing json.loads
on that:
with open('bigjson.json') as f:
data = json.load(f)
This allows the decoder to read the file in at its own leisure and will most likely reduce memory usage. In your original version, you had to at least store the entire original file in memory before you could even start parsing the JSON. This allows the file to be streamed as the decoder needs it.
Also I see you're using Python 2.7. Any particular reason why? dict
s have seen a lot of updates in 3, particular ones which drastically reduce memory usage. Perhaps consider benchmarking against 3 as well if memory usage is this much of a problem.
The problem you're running into here isn't that the memory isn't being released.
The "mem usage" column likely is an indication of the program's RSS (which roughly is the amount of memory available to a process without needing to ask the OS for more space). The README for memory_profiler
don't seem to indicate this precisely, but they make some vague statements that would suggest this: "the second column (Mem usage) the memory usage of the Python interpreter after that line has been executed."
Assuming this, we see that in all OSes, after the json dict
is reclaimed the RSS of the program is halved (suspicious, isn't it? we'll get to that a bit later). That's because there are a lot of layers here. Roughly, we have:
Your code -> Python Runtime/GC -> userland allocator -> (syscall) -> Operating System -> Physical RAM
When something goes out of scope it can be released from the perspective of your code. The Python GC makes no guarantees as to when this happens, but if you call gc.collect()
and the objects are out of scope (have 0 refcount), then they should indeed be released by the Python runtime. But, that gives memory back to the userland allocator. This may or may not give memory back to the OS. We see it do that after we reclaim jsonobj
in all OSes. But instead of giving back everything it just about halves the memory usage. That should raise a red flag since that magic halved number appeared nowhere else. This is a good indication that the userland allocator is doing some work here.
Recalling some basic datastructures, a vector
(a dynamically sized, growable and shrinkable array) typically starts as a NULL pointer. Then, as you append elements to it, it grows. We usually grow vectors by doubling their size, because this gives desirable amortized performance. Inserts will on average take constant time regardless of the final length of the vector. (And the same for any deletes, which may cause a shrinkage by a factor of 2)
The memory allocator underneath Python's GC is likely employing an approach similar to this. Instead of reclaiming all of the memory used, it is taking a guess that later you may need at least half of that again. If you don't, then yes it is indeed retaining too much (but not leaking). But if you do (and memory usage for things like webservers is often bursty like this), then this guess saves you allocation time in the future (which at this level is a syscall).
In your benchmark where you run the code multiple times, you see this behavior. It keeps enough memory around such that the initial jsonfile.read()
can fit into memory without needing to ask for more. If there was a bug somewhere (there was a memory leak), you'd see the memory usage trending upwards over time. I don't think your data looks like this. For example, see the graph from another featured Python question. That is what a memory leak looks like.
If you want to be doubly sure, you could run your script with valgrind. This will confirm for you whether there is a memory leak anywhere in userland. But, I suspect this is not the case.
edit: as an aside, if you're dealing with files this large, perhaps JSON isn't the proper format to be storing them in. Something that you can stream is likely to be much more memory friendly (python generators are great for this). If the JSON format is unavoidable and this memory usage is indeed a problem, you'll likely want to work in a language that gives you more fine grained control over memory layout and allocations like C, C++, or Rust. A fine tuned C struct representing your data likely can do a much better job of packing your data than Python dict
(especially a 2.7 dict
). Furthermore, if you're doing this operation frequently, you could mmap the file (perhaps dump a wire format to the file so when mmapped in you can just read directly from it). Or, load it in once and let the OS deal with it. High memory usage isn't a problem, because most OSes are very good at paging out memory when it is less frequently accessed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With