I have a Python 3.6 data processing task that involves pre-loading a large dict for looking up dates by ID for use in a subsequent step by a pool of sub-processes managed by the multiprocessing module. This process was eating up most if not all of the memory on the box, so one optimisation I applied was to 'intern' the string dates being stored in the dict. This reduced the memory footprint of the dict by several GBs as I expected it would, but it also had another unexpected effect.
Before applying interning, the sub-processes would gradually eat more and more memory as they executed, which I believe was down to them having to copy the dict gradually from global memory across to the sub-processes' individual allocated memory (this is running on Linux and so benefits from the copy-on-write behaviour of fork()). Even though I'm not updating the dict in the sub-processes, it looks like read-only access can still trigger copy-on-write through reference counting.
I was only expecting the interning to reduce the memory footprint of the dict, but in fact it stopped the memory usage gradually increasing over the sub-processes lifetime as well.
Here's a minimal example I was able to build that replicates the behaviour, although it requires a large file to load in and populate the dict with and a sufficient amount of repetition in the values to make sure that interning provides a benefit.
import multiprocessing
import sys
# initialise a large dict that will be visible to all processes
# that contains a lot of repeated values
global_map = dict()
with open(sys.argv[1], 'r', encoding='utf-8') as file:
if len(sys.argv) > 2:
print('interning is on')
else:
print('interning is off')
for i, line in enumerate(file):
if i > 30000000:
break
parts = line.split('|')
if len(sys.argv) > 2:
global_map[str(i)] = sys.intern(parts[2])
else:
global_map[str(i)] = parts[2]
def read_map():
# do some nonsense processing with each value in the dict
global global_map
for i in range(30000000):
x = global_map[str(i)]
y = x + '_'
return y
print("starting processes")
process_pool = multiprocessing.Pool(processes=10)
for _ in range(10):
process_pool.apply_async(read_map)
process_pool.close()
process_pool.join()
I ran this script and monitored htop
to see the total memory usage.
interning? | mem usage just after 'starting processes' printed | peak mem usage after that |
---|---|---|
no | 7.1GB | 28.0GB |
yes | 5.5GB | 5.6GB |
While I am delighted that this optimisation seems to have fixed all my memory issues at once, I'd like to understand better why this works. If the creeping memory usage by the sub-processes is down to copy-on-write, why doesn't this happen if I intern the strings?
Shared memory : multiprocessing module provides Array and Value objects to share data between processes. Array: a ctypes array allocated from shared memory. Value: a ctypes object allocated from shared memory.
Shared memory can be a very efficient way of handling data in a program that uses concurrency. Python's mmap uses shared memory to efficiently share large amounts of data between multiple Python processes, threads, and tasks that are happening concurrently.
The CPython
implementation stores interned strings in a global object that is a regular Python dictionary where both, keys and values are pointers to string objects.
When a new child process is created, it gets a copy of the parent's address space so they will use the reduced data dictionary with interned strings.
I've compiled Python with the patch below and as you can see, both processes have access to the table with interned strings:
test.py:
import multiprocessing as mp
import sys
import _string
PROCS = 2
STRING = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
def worker():
proc = mp.current_process()
interned = _string.interned()
try:
idx = interned.index(STRING)
except ValueError:
s = None
else:
s = interned[idx]
print(f"{proc}: <{s}>")
def main():
sys.intern(STRING)
procs = []
for _ in range(PROCS):
p = mp.Process(target=worker)
p.start()
procs.append(p)
for p in procs:
p.join()
if __name__ == "__main__":
main()
Test:
# python test.py
<Process name='Process-1' parent=3917 started>: <https://www.youtube.com/watch?v=dQw4w9WgXcQ>
<Process name='Process-2' parent=3917 started>: <https://www.youtube.com/watch?v=dQw4w9WgXcQ>
Patch:
--- Objects/unicodeobject.c 2021-05-15 15:08:05.117433926 +0100
+++ Objects/unicodeobject.c.tmp 2021-05-15 23:48:35.236152366 +0100
@@ -16230,6 +16230,11 @@
_PyUnicode_FiniEncodings(&tstate->interp->unicode.fs_codec);
}
+static PyObject *
+interned_impl(PyObject *module)
+{
+ return PyDict_Values(interned);
+}
/* A _string module, to export formatter_parser and formatter_field_name_split
to the string.Formatter class implemented in Python. */
@@ -16239,6 +16244,8 @@
METH_O, PyDoc_STR("split the argument as a field name")},
{"formatter_parser", (PyCFunction) formatter_parser,
METH_O, PyDoc_STR("parse the argument as a format string")},
+ {"interned", (PyCFunction) interned_impl,
+ METH_NOARGS, PyDoc_STR("lookup interned strings")},
{NULL, NULL}
};
You may also want to take a look at shared_memory module.
References:
Not an answer, but I thought it was of interest to provide a MWE that does not require an input file. Peak memory usage is much higher when manual interning is turned off, which HTF properly explained in my opinion.
from multiprocessing import Pool
from random import choice
from string import ascii_lowercase
# from sys import intern
def rand_str(length):
return ''.join([choice(ascii_lowercase) for i in range(length)])
def read_map():
for value in global_map.values():
x = value
y = x + '_'
return y
global_map = dict()
for i in range(20_000_000):
# global_map[str(i)] = intern(rand_str(4))
global_map[str(i)] = rand_str(4)
print("starting processes")
if __name__ == '__main__':
with Pool(processes=2) as process_pool:
processes = [process_pool.apply_async(read_map)
for process in range(process_pool._processes)]
for process in processes:
process.wait()
print(process.get())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With