Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python keeps running for 10+mins (after last statement in program) when there is huge (33GB) data structure in memory (nothing in swap)

I have need to parse a huge gz file (about ~10GB compressed, ~100GB uncompressed). The code creates data structure ('data_struct') in memory. I am running on a machine with Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz with 16 CPUs and plenty RAM (ie 200+ GB), running CentOS-6.9. I have implemented these things using a Class in Python3.6.3 (CPython) as shown below :

class my_class():
    def __init__(self):
        cmd = f'gunzip huge-file.gz'
        self.process = subprocess(cmd, stdout=subprocess.PIPE, shell=True)
        self.data_struct = dict()

    def populate_struct(self):
        for line in process.stdout:
            <populate the self.data_struct dictionary>
        
    def __del__():
        self.process.wait()
        #del self.data_struct  # presence/absence of this statement decreases/increases runtime respectively
#================End of my_class===================

def main():
    my_object = my_class()
    my_object.populate_struct()
    print(f'~~~~ Finished populate_struct() ~~~~')  # last statement in my program.
    ## Python keeps running at 100% past the previous statement for 10+mins

if __name__ == '__main__':
    main()
#================End of Main=======================

The resident memory consumption of my data_struct in memory (RAM only, no swap) is about ~33GB. I did $ top to find the PID of Python process and traced the Python process using $ strace -p <PID> -o <out_file> (to see what Python is doing). While it is executing populate_struct(), I can see in the out_file of strace that Python is using calls like mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b0684160000 to create data_struct. While Python was running past the last print() statement, I found that Python was issuing only munmap() operations as shown below :

munmap(0x2b3c75375000, 41947136)        = 0
munmap(0x2b3c73374000, 33558528)        = 0
munmap(0x2b4015d2a000, 262144)          = 0
munmap(0x2b4015cea000, 262144)          = 0
munmap(0x2b4015caa000, 262144)          = 0
munmap(0x2b4015c6a000, 262144)          = 0
munmap(0x2b4015c2a000, 262144)          = 0
munmap(0x2b4015bea000, 262144)          = 0
munmap(0x2b4015baa000, 262144)          = 0
...
...

The Python keeps running anywhere between 10+ mins to 12mins after the last print() statement. An observation is that if I have del self.data_struct statement in __del__() method, then it takes 2mins only. I have done these experiments multiple times and the runtime decrease/increase by the presence/absence of del self.data_struct in __del__().

My questions :

  1. I understanding is that Python is doing cleanup work by using munmap(), but unlike Python, other languages like Perl immediately release memory and exit the program. Am I doing it right by implementing as shown above ? Is there a way to tell Python to avoid this munmap() ?
  2. Why does it take 10+mins to cleanup if there is no del self.data_struct statement in __del__(), and takes only 2mins to cleanup if there is del self.data_struct statement in __del__() ?
  3. Is there a way to speedup the cleanup work ie munmap()?
  4. Is there a way to exit program immediately without the cleanup work ?

Other thoughts/suggestions about tackling this problem are appreciated.

like image 304
Shiva Avatar asked Aug 10 '20 22:08

Shiva


1 Answers

Please try a more recent version of Python (at least 3.8)? This shows several signs of being a mild(!) form of a worst-case quadratic-time algorithm in CPython's object deallocator, which was rewritten here (and note that the issue linked to here in turn contains a link to an older StackOverflow post with more details):

https://bugs.python.org/issue37029

Some glosses

If my guess is right, the amount of memory isn't particularly important - it's instead the sheer number of distinct Python objects being managed by CPython's "small object allocator" (obmalloc.c), combined with "bad luck" in the order in which their memory is released.

When that code was first written, RAM wasn't big enough to hold millions of Python objects, so nobody noticed that one particular part of deallocation logic could take time quadratic in the number of allocated "arenas" (details aren't really helpful, but "arenas" are the granularity at which system mmap() and munmap() calls are made - 256 KiB chunks).

It's not those mapping calls that are consuming mounds of time, and any decent implementation of any language using OS memory mapping facilities will eventually call munmap() enough times to release the OS resources consumed by its mmap() calls.

So that's a red herring. munmap() is being called many times simply because you allocated many objects, which required many mmap() calls.

There isn't any crisp or easy way to explain exactly when the problem shows up. See "bad luck" above ;-) The relevant code was rewritten for CPython 3.8 to be worst-case linear time instead, which gave a factor of ~250 speedup for the specific program that triggered the issue report (see the link already given).

As a comment noted, you can exit your program immediately at any time by invoking os._exit(), but the leading underscore is meant to scare you off: "immediately" means "immediately". No cleanups of any kind are performed. For example, the __del__ method in your class? Skipped. __del__ is run as a side effect of deallocation, but if you actually "immediately release memory and exit the program" then no destructors of any kind are run, nor any handlers registered with the atexit module, etc etc. It's as drastic as a program dying, e.g., with a segfault.

like image 176
Tim Peters Avatar answered Oct 18 '22 22:10

Tim Peters