Understanding memory usage in python

Tags:

memory

I'm trying to understand how python is using memory to estimate how many processes I can run at a time. Right now I process large files on a server with large amounts of ram (~90-150GB of free RAM).

For a test, I would do things in python, then look at htop to see what the usage was.

step 1: I open a file which is 2.55GB and save it to a string

with open(file,'r') as f:
    data=f.read()

Usage is 2686M

step 2: I split the file on newlines

data = data.split('\n')

usage is 7476M

step 3: I keep only every 4th line (two of the three lines I remove are of equal length to the line I keep)

data=[data[x] for x in range(0,len(data)) if x%4==1]

usage is 8543M

step 4:I split this into 20 equal chunks to run through a multiprocessing pool.

l=[] 
for b in range(0,len(data),len(data)/40):
    l.append(data[b:b+(len(data)/40)])

usage is 8621M

step 5: I delete data, usage is 8496M.

There are several things that are not making sense to me.

In step two, why does the memory usage go up so much when I change the string into an array. I am assuming that the array containers are much larger than the string container?

in step three why doesn't the data shrink significantly. I essentially got rid of 3/4 of my arrays and at least 2/3 of the data within the array. I would expect it to shrink accordingly. Calling the garbage collector did not make any difference.

oddly enough when I assigned the smaller array to another variable it uses less memory. usage 6605M

when I delete the old object data: usage 6059M

This seems weird to me. Any help on shrinking my memory foot print would be appreciated.

EDIT

Okay, this is making my head hurt. Clearly python is doing some weird things behind the scenes here... and only python. I've made following script to demonstrate this using my original method and the method suggested in the answer below. Numbers are all in GB.

TEST CODE

import os,sys
import psutil
process = psutil.Process(os.getpid())
import time

py_usage=process.memory_info().vms / 1000000000.0
in_file = "14982X16.fastq"

def totalsize(o):
    size = 0
    for x in o:
        size += sys.getsizeof(x)
    size += sys.getsizeof(o)
    return "Object size:"+str(size/1000000000.0)

def getlines4(f):
    for i, line in enumerate(f):
        if i % 4 == 1:
            yield line.rstrip()

def method1():
    start=time.time()
    with open(in_file,'rb') as f:
        data = f.read().split("\n")
    data=[data[x] for x in xrange(0,len(data)) if x%4==1]
    return data

def method2():
    start=time.time()
    with open(in_file,'rb') as f:
        data2=list(getlines4(f))
    return data2


print "method1 == method2",method1()==method2()
print "Nothing in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data=method1()
print "data from method1 is in memory"
print "method1", totalsize(data)
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data
print "Nothing in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data2=method2()
print "data from method2 is in memory"
print "method2", totalsize(data2)
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data2
print "Nothing is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage


print "\nPrepare to have your mind blown even more!"
data=method1()
print "Data from method1 is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data2=method2()
print "Data from method1 and method 2 are in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data==data2
print "Compared the two lists"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data
print "Data from method2 is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data2
print "Nothing is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage

OUTPUT

method1 == method2 True
Nothing in memory
Usage: 0.001798144
data from method1 is in memory
method1 Object size:1.52604683
Usage: 4.552925184
Nothing in memory
Usage: 0.001798144
data from method2 is in memory
method2 Object size:1.534815518
Usage: 1.56932096
Nothing is in memory
Usage: 0.001798144

Prepare to have your mind blown even more!
Data from method1 is in memory
Usage: 4.552925184
Data from method1 and method 2 are in memory
Usage: 4.692287488
Compared the two lists
Usage: 4.692287488
Data from method2 is in memory
Usage: 4.56169472
Nothing is in memory
Usage: 0.001798144

for those of you using python3 its pretty similar, except not as bad after the comparison operation...

OUTPUT FROM PYTHON3

method1 == method2 True
Nothing in memory
Usage: 0.004395008000000006
data from method1 is in memory
method1 Object size:1.718523294
Usage: 5.322555392
Nothing in memory
Usage: 0.004395008000000006
data from method2 is in memory
method2 Object size:1.727291982
Usage: 1.872596992
Nothing is in memory
Usage: 0.004395008000000006

Prepare to have your mind blown even more!
Data from method1 is in memory
Usage: 5.322555392
Data from method1 and method 2 are in memory
Usage: 5.461917696
Compared the two lists
Usage: 5.461917696
Data from method2 is in memory
Usage: 2.747633664
Nothing is in memory
Usage: 0.004395008000000006

moral of the story... memory for python appear to be a bit like Camelot for Monty Python... 'tis a very silly place.

775

asked May 01 '18 22:05

jeffpkamp

1 Answers

I'm going to suggest that you back off and approach this instead in a way that directly addresses your goal: shrinking peak memory use to begin with. No amount of analysis & fiddling later can overcome using a doomed approach to begin with ;-)

Concretely, you got off on a wrong foot at the first step, via data=f.read(). Now it's already the case that your program can't possibly scale beyond a data file that fits entirely in RAM with room to spare (to run the OS and Python and ...) too.

Do you actually need all the data to be in RAM at one time? There are too few details to tell about later steps, but obviously not at the start, since you immediately want to throw away 75% of the lines you read.

So start off by doing that incrementally instead:

def getlines4(f):
    for i, line in enumerate(f):
        if i % 4 == 1:
            yield line

Even if you do nothing other than just that much, you can skip directly to the result of step 3, saving an enormous amount of peak RAM use:

with open(file, 'r') as f:
    data = list(getlines4(f))

Now peak RAM need is proportional to the number of bytes in the only lines you care about, instead of to the total number of file bytes period.

To continue making progress, instead of materializing all the lines of interest in data in one giant gulp, feed the lines (or chunks of lines) incrementally to your worker processes too. There wasn't enough detail for me to suggest concrete code for that, but keep the goal in mind and you'll figure it out: you only need enough RAM to keep incrementally feeding lines to worker processes, and to save away however much of the worker processes' results you need to keep in RAM. It's possible that peak memory use doesn't need to more than "tiny", regardless of input file size.

Fighting memory management details instead is enormously harder than taking a memory-friendly approach to begin with. Python itself has several memory-management subsystems, and a great deal can be said about each of them. They in turn rely on the platform C malloc/free facilities, about which there's also a great deal to learn. And we're still not at a level that has anything directly to do with what your operating system reports for "memory use". The platform C libraries in turn rely on platform-specific OS memory managment primitives, which - typically - only OS kernel memory experts truly understand.

The answer to "why does the OS say I'm still using N GiB of RAM?" can rely on application-specific details in any one of those layers, or even on unfortunate more-or-less accidental interactions among them. Far better to arrange not to need to ask such questions to begin with.

EDIT - about CPython's obmalloc

It's great that you gave some runnable code, but not so great that nobody but you can run it since nobody else has your data ;-) Things like "how many lines are there?" and "what's the distribution of line lengths?" can be critical, but we have no way to guess.

As I noted before, application-specific details are often necessary to out-think modern memory managers. They're complex, and behavior at all the levels can be subtle.

Python's primary object allocator ("obmalloc") requests "arenas" from the platform C malloc, chunks of 2**18 bytes. So long as that's the Python memory system your application is using (which can't be guessed at because we don't have your data to work with), 256 KiB is the smallest granularity at which memory is requested from, or returned to, the C level. The C level in turn typically has "chunk things up" strategies of its own, which vary across C implementations.

A Python arena is in turn carved into 4 KiB "pools", each of which dynamically adapts to be carved into smaller chunks of a fixed size per pool (8-byte chunks, 16-bytes chunks, 24-byte chunks, ..., 8*i-byte chunks per pool).

So long as a single byte in an arena is being used for live data, the entire arena must be retained. If that means the other 262,143 arena bytes sit unused, tough luck. As your output shows, all the memory is returned in the end, so why do you really care? I understand it's an abstractly interesting puzzle, but you're not going to solve it short of making major efforts to understand the code in CPython's obmalloc.c. For a start. Any "summary" would leave out a detail that's actually important to some application's microscopic behavior.

Plausible: your strings are short enough that space for all the string object headers and contents (the actual string data) are obtained from CPython's obmalloc. They're going to be splattered all over multiple arenas. An arena might look like this, where "H" represents pools from which string object headers are allocated, and "D" pools from which space for string data is allocated:

HHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDD...

In your method1 they'll tend to alternate "like that" because creating a single string object requires allocating space separately for the string object header and the string object data. When you go on to throw out 3/4ths of the strings you created, more-or-less 3/4ths of that space becomes reusable to Python. But not one byte can be returned to the system C because there's still live data sprayed all over the arena, containing the quarter of the string objects you didn't throw away (here "-" means space available for reuse):

HHDD------------HHDD------------HHDD------------HHDD----...

There's so much free space that, in fact, it's possible that the less wasteful method2 can get all the memory it needs from the -------- holes left over from method1 even when you don't throw away the method1 result.

Just to keep things simple ;-) , I'll note that some of those details about how CPython's obmalloc gets used vary across Python releases too. In general, the more recent the Python release, the more it tries to use obmalloc first instead of the platform C malloc/free (because obmalloc is generally faster).

But even if you use the platform C malloc/free directly, you can still see the same kinds of things happening. Kernel memory system calls are typically more expensive than running code purely in user space, so platform C malloc/free routines typically have their own strategies for "ask the kernel for much more memory than we need for a single request, and carve it up into smaller pieces ourself".

Something to note: neither Python's obmalloc nor platorm C malloc/free implementations ever move live data on their own. Both return memory addresses to clients, and those cannot change. "Holes" are an inescapable fact of life under both.

110

answered Sep 19 '22 01:09

Tim Peters

Related questions
                            
                                Cmd Windows "python" command works, but "python3" doesn't although my python version is 3.6
                            
                                React Flask Heroku App is not displaying frontend
                            
                                Modifying class __dict__ when shadowed by a property
                            
                                How to download this video using Selenium
                            
                                How do you recursively get all submodules in a python package?
                            
                                In Python 3.6, why does a negative number to the power of a fraction return nan when in a numpy array?
                            
                                Slice pandas dataframe json column into columns
                            
                                Is there a way to get the error in fitting parameters from scipy.stats.norm.fit?
                            
                                Save jaw only as image with dlib facial landmark detection and the rest to be transparent
                            
                                Django - Form across multiple views with progress saving
                            
                                How does the `my_input_fn` in the getting started with TensorFlow allow enumeration over the data?
                            
                                Google colaboratory run code locally
                            
                                Change training dataset every N epochs in Keras
                            
                                Activating python virtual environment does not switch to local versions of pip and python commands
                            
                                Is it possible to run multiple instances of one selenium test at once?
                            
                                Why python broadcasting in the example below is slower than a simple loop?
                            
                                Removing the white border around an image when using matplotlib without saving the image
                            
                                _pickle.PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
                            
                                Efficiently find overlap of date-time ranges from 2 dataframes
                            
                                Numpy: conditional sum

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With