Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does interning global string values result in less memory used per multiprocessing process?

I have a Python 3.6 data processing task that involves pre-loading a large dict for looking up dates by ID for use in a subsequent step by a pool of sub-processes managed by the multiprocessing module. This process was eating up most if not all of the memory on the box, so one optimisation I applied was to 'intern' the string dates being stored in the dict. This reduced the memory footprint of the dict by several GBs as I expected it would, but it also had another unexpected effect.

Before applying interning, the sub-processes would gradually eat more and more memory as they executed, which I believe was down to them having to copy the dict gradually from global memory across to the sub-processes' individual allocated memory (this is running on Linux and so benefits from the copy-on-write behaviour of fork()). Even though I'm not updating the dict in the sub-processes, it looks like read-only access can still trigger copy-on-write through reference counting.

I was only expecting the interning to reduce the memory footprint of the dict, but in fact it stopped the memory usage gradually increasing over the sub-processes lifetime as well.

Here's a minimal example I was able to build that replicates the behaviour, although it requires a large file to load in and populate the dict with and a sufficient amount of repetition in the values to make sure that interning provides a benefit.

import multiprocessing
import sys

# initialise a large dict that will be visible to all processes
# that contains a lot of repeated values
global_map = dict()
with open(sys.argv[1], 'r', encoding='utf-8') as file:
  if len(sys.argv) > 2:
    print('interning is on')
  else:
    print('interning is off')
  for i, line in enumerate(file):
    if i > 30000000:
      break
    parts = line.split('|')
    if len(sys.argv) > 2:
      global_map[str(i)] = sys.intern(parts[2])
    else:
      global_map[str(i)] = parts[2]

def read_map():
  # do some nonsense processing with each value in the dict
  global global_map
  for i in range(30000000):
    x = global_map[str(i)]
  y = x + '_'
  return y

print("starting processes")
process_pool = multiprocessing.Pool(processes=10)

for _ in range(10):
  process_pool.apply_async(read_map)

process_pool.close()

process_pool.join()

I ran this script and monitored htop to see the total memory usage.

interning? mem usage just after 'starting processes' printed peak mem usage after that
no 7.1GB 28.0GB
yes 5.5GB 5.6GB

While I am delighted that this optimisation seems to have fixed all my memory issues at once, I'd like to understand better why this works. If the creeping memory usage by the sub-processes is down to copy-on-write, why doesn't this happen if I intern the strings?

like image 558
Rob Streeting Avatar asked May 10 '21 13:05

Rob Streeting


People also ask

Does multiprocessing shared memory?

Shared memory : multiprocessing module provides Array and Value objects to share data between processes. Array: a ctypes array allocated from shared memory. Value: a ctypes object allocated from shared memory.

How does Python shared memory work?

Shared memory can be a very efficient way of handling data in a program that uses concurrency. Python's mmap uses shared memory to efficiently share large amounts of data between multiple Python processes, threads, and tasks that are happening concurrently.


2 Answers

The CPython implementation stores interned strings in a global object that is a regular Python dictionary where both, keys and values are pointers to string objects.

When a new child process is created, it gets a copy of the parent's address space so they will use the reduced data dictionary with interned strings.

I've compiled Python with the patch below and as you can see, both processes have access to the table with interned strings:

test.py:

import multiprocessing as mp
import sys
import _string


PROCS = 2
STRING = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"


def worker():
    proc = mp.current_process()
    interned = _string.interned()

    try:
        idx = interned.index(STRING)
    except ValueError:
        s = None
    else:
        s = interned[idx]

    print(f"{proc}: <{s}>")


def main():
    sys.intern(STRING)

    procs = []

    for _ in range(PROCS):
        p = mp.Process(target=worker)
        p.start()
        procs.append(p)

    for p in procs:
        p.join()


if __name__ == "__main__":
    main()

Test:

# python test.py 
<Process name='Process-1' parent=3917 started>: <https://www.youtube.com/watch?v=dQw4w9WgXcQ>
<Process name='Process-2' parent=3917 started>: <https://www.youtube.com/watch?v=dQw4w9WgXcQ>

Patch:

--- Objects/unicodeobject.c 2021-05-15 15:08:05.117433926 +0100
+++ Objects/unicodeobject.c.tmp 2021-05-15 23:48:35.236152366 +0100
@@ -16230,6 +16230,11 @@
     _PyUnicode_FiniEncodings(&tstate->interp->unicode.fs_codec);
 }
 
+static PyObject *
+interned_impl(PyObject *module)
+{
+    return PyDict_Values(interned);
+}
 
 /* A _string module, to export formatter_parser and formatter_field_name_split
    to the string.Formatter class implemented in Python. */
@@ -16239,6 +16244,8 @@
      METH_O, PyDoc_STR("split the argument as a field name")},
     {"formatter_parser", (PyCFunction) formatter_parser,
      METH_O, PyDoc_STR("parse the argument as a format string")},
+    {"interned", (PyCFunction) interned_impl,
+     METH_NOARGS, PyDoc_STR("lookup interned strings")},
     {NULL, NULL}
 };

You may also want to take a look at shared_memory module.

References:

  • The internals of Python string interning
like image 157
HTF Avatar answered Sep 28 '22 04:09

HTF


Not an answer, but I thought it was of interest to provide a MWE that does not require an input file. Peak memory usage is much higher when manual interning is turned off, which HTF properly explained in my opinion.

from multiprocessing import Pool
from random import choice
from string import ascii_lowercase
# from sys import intern


def rand_str(length):
    return ''.join([choice(ascii_lowercase) for i in range(length)])


def read_map():
    for value in global_map.values():
        x = value
    y = x + '_'
    return y


global_map = dict()
for i in range(20_000_000):
    # global_map[str(i)] = intern(rand_str(4))
    global_map[str(i)] = rand_str(4)
print("starting processes")
if __name__ == '__main__':
    with Pool(processes=2) as process_pool:
        processes = [process_pool.apply_async(read_map)
                     for process in range(process_pool._processes)]
        for process in processes:
            process.wait()
            print(process.get())
like image 33
Patol75 Avatar answered Sep 28 '22 05:09

Patol75