Why does interning global string values result in less memory used per multiprocessing process?

Tags:

I have a Python 3.6 data processing task that involves pre-loading a large dict for looking up dates by ID for use in a subsequent step by a pool of sub-processes managed by the multiprocessing module. This process was eating up most if not all of the memory on the box, so one optimisation I applied was to 'intern' the string dates being stored in the dict. This reduced the memory footprint of the dict by several GBs as I expected it would, but it also had another unexpected effect.

Before applying interning, the sub-processes would gradually eat more and more memory as they executed, which I believe was down to them having to copy the dict gradually from global memory across to the sub-processes' individual allocated memory (this is running on Linux and so benefits from the copy-on-write behaviour of fork()). Even though I'm not updating the dict in the sub-processes, it looks like read-only access can still trigger copy-on-write through reference counting.

I was only expecting the interning to reduce the memory footprint of the dict, but in fact it stopped the memory usage gradually increasing over the sub-processes lifetime as well.

Here's a minimal example I was able to build that replicates the behaviour, although it requires a large file to load in and populate the dict with and a sufficient amount of repetition in the values to make sure that interning provides a benefit.

Click to copy

import multiprocessing
import sys

# initialise a large dict that will be visible to all processes
# that contains a lot of repeated values
global_map = dict()
with open(sys.argv[1], 'r', encoding='utf-8') as file:
  if len(sys.argv) > 2:
    print('interning is on')
  else:
    print('interning is off')
  for i, line in enumerate(file):
    if i > 30000000:
      break
    parts = line.split('|')
    if len(sys.argv) > 2:
      global_map[str(i)] = sys.intern(parts[2])
    else:
      global_map[str(i)] = parts[2]

def read_map():
  # do some nonsense processing with each value in the dict
  global global_map
  for i in range(30000000):
    x = global_map[str(i)]
  y = x + '_'
  return y

print("starting processes")
process_pool = multiprocessing.Pool(processes=10)

for _ in range(10):
  process_pool.apply_async(read_map)

process_pool.close()

process_pool.join()

I ran this script and monitored htop to see the total memory usage.

interning?	mem usage just after 'starting processes' printed	peak mem usage after that
no	7.1GB	28.0GB
yes	5.5GB	5.6GB

While I am delighted that this optimisation seems to have fixed all my memory issues at once, I'd like to understand better why this works. If the creeping memory usage by the sub-processes is down to copy-on-write, why doesn't this happen if I intern the strings?

558

asked May 10 '21 13:05

Rob Streeting

2 Answers

The CPython implementation stores interned strings in a global object that is a regular Python dictionary where both, keys and values are pointers to string objects.

When a new child process is created, it gets a copy of the parent's address space so they will use the reduced data dictionary with interned strings.

I've compiled Python with the patch below and as you can see, both processes have access to the table with interned strings:

test.py:

Click to copy

import multiprocessing as mp
import sys
import _string


PROCS = 2
STRING = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"


def worker():
    proc = mp.current_process()
    interned = _string.interned()

    try:
        idx = interned.index(STRING)
    except ValueError:
        s = None
    else:
        s = interned[idx]

    print(f"{proc}: <{s}>")


def main():
    sys.intern(STRING)

    procs = []

    for _ in range(PROCS):
        p = mp.Process(target=worker)
        p.start()
        procs.append(p)

    for p in procs:
        p.join()


if __name__ == "__main__":
    main()

Test:

Click to copy

# python test.py 
<Process name='Process-1' parent=3917 started>: <https://www.youtube.com/watch?v=dQw4w9WgXcQ>
<Process name='Process-2' parent=3917 started>: <https://www.youtube.com/watch?v=dQw4w9WgXcQ>

Patch:

Click to copy

--- Objects/unicodeobject.c 2021-05-15 15:08:05.117433926 +0100
+++ Objects/unicodeobject.c.tmp 2021-05-15 23:48:35.236152366 +0100
@@ -16230,6 +16230,11 @@
     _PyUnicode_FiniEncodings(&tstate->interp->unicode.fs_codec);
 }
 
+static PyObject *
+interned_impl(PyObject *module)
+{
+    return PyDict_Values(interned);
+}
 
 /* A _string module, to export formatter_parser and formatter_field_name_split
    to the string.Formatter class implemented in Python. */
@@ -16239,6 +16244,8 @@
      METH_O, PyDoc_STR("split the argument as a field name")},
     {"formatter_parser", (PyCFunction) formatter_parser,
      METH_O, PyDoc_STR("parse the argument as a format string")},
+    {"interned", (PyCFunction) interned_impl,
+     METH_NOARGS, PyDoc_STR("lookup interned strings")},
     {NULL, NULL}
 };

You may also want to take a look at shared_memory module.

References:

The internals of Python string interning

157

answered Sep 28 '22 04:09

HTF

Not an answer, but I thought it was of interest to provide a MWE that does not require an input file. Peak memory usage is much higher when manual interning is turned off, which HTF properly explained in my opinion.

Click to copy

from multiprocessing import Pool
from random import choice
from string import ascii_lowercase
# from sys import intern


def rand_str(length):
    return ''.join([choice(ascii_lowercase) for i in range(length)])


def read_map():
    for value in global_map.values():
        x = value
    y = x + '_'
    return y


global_map = dict()
for i in range(20_000_000):
    # global_map[str(i)] = intern(rand_str(4))
    global_map[str(i)] = rand_str(4)
print("starting processes")
if __name__ == '__main__':
    with Pool(processes=2) as process_pool:
        processes = [process_pool.apply_async(read_map)
                     for process in range(process_pool._processes)]
        for process in processes:
            process.wait()
            print(process.get())

answered Sep 28 '22 05:09

Patol75

Related questions
                            
                                matplotlib text: Use data coords for x, axis coords for y
                            
                                Implementing a recursive algorithm in pyspark to find pairings within a dataframe
                            
                                Trying to understand __init__.py combined with getattr
                            
                                Implementing inplace operations for methods in a class
                            
                                How can I list the extra features of a Python package
                            
                                multiprocessing in python - what gets inherited by forkserver process from parent process?
                            
                                Get hour of year from a Datetime
                            
                                How to terminate loop.run_in_executor with ProcessPoolExecutor gracefully?
                            
                                How to group a dataframe by 4 time periods and key
                            
                                Filtering non-'cohorts' from dataset
                            
                                How to solve "RuntimeError: CUDA error: invalid device ordinal"?
                            
                                Linear Programing- Max value optimization
                            
                                How to do arithmetic right shift in python for signed and unsigned values
                            
                                PySpark "illegal reflective access operation" when executed in terminal
                            
                                YellowBrick ImportError: cannot import name 'safe_indexing' from 'sklearn.utils'
                            
                                Pandas Lookup to be deprecated - elegant and efficient alternative
                            
                                Calculate how much of a trajectory/path falls in-between two other trajectories
                            
                                Django rest framework - how can i limit requests to an API endpoint?
                            
                                Segmentation fault when importing a C++ shared object in Python
                            
                                Converting Bad Text to Korean

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does interning global string values result in less memory used per multiprocessing process?

Tags:

python

linux

multiprocessing

Rob Streeting

People also ask

2 Answers

HTF

Patol75

Recent Activity

Donate For Us