I'm trying to understand how python is using memory to estimate how many processes I can run at a time. Right now I process large files on a server with large amounts of ram (~90-150GB of free RAM).
For a test, I would do things in python, then look at htop to see what the usage was.
step 1: I open a file which is 2.55GB and save it to a string
with open(file,'r') as f:
data=f.read()
Usage is 2686M
step 2: I split the file on newlines
data = data.split('\n')
usage is 7476M
step 3: I keep only every 4th line (two of the three lines I remove are of equal length to the line I keep)
data=[data[x] for x in range(0,len(data)) if x%4==1]
usage is 8543M
step 4:I split this into 20 equal chunks to run through a multiprocessing pool.
l=[]
for b in range(0,len(data),len(data)/40):
l.append(data[b:b+(len(data)/40)])
usage is 8621M
step 5: I delete data, usage is 8496M.
There are several things that are not making sense to me.
In step two, why does the memory usage go up so much when I change the string into an array. I am assuming that the array containers are much larger than the string container?
in step three why doesn't the data shrink significantly. I essentially got rid of 3/4 of my arrays and at least 2/3 of the data within the array. I would expect it to shrink accordingly. Calling the garbage collector did not make any difference.
oddly enough when I assigned the smaller array to another variable it uses less memory. usage 6605M
when I delete the old object data
: usage 6059M
This seems weird to me. Any help on shrinking my memory foot print would be appreciated.
EDIT
Okay, this is making my head hurt. Clearly python is doing some weird things behind the scenes here... and only python. I've made following script to demonstrate this using my original method and the method suggested in the answer below. Numbers are all in GB.
TEST CODE
import os,sys
import psutil
process = psutil.Process(os.getpid())
import time
py_usage=process.memory_info().vms / 1000000000.0
in_file = "14982X16.fastq"
def totalsize(o):
size = 0
for x in o:
size += sys.getsizeof(x)
size += sys.getsizeof(o)
return "Object size:"+str(size/1000000000.0)
def getlines4(f):
for i, line in enumerate(f):
if i % 4 == 1:
yield line.rstrip()
def method1():
start=time.time()
with open(in_file,'rb') as f:
data = f.read().split("\n")
data=[data[x] for x in xrange(0,len(data)) if x%4==1]
return data
def method2():
start=time.time()
with open(in_file,'rb') as f:
data2=list(getlines4(f))
return data2
print "method1 == method2",method1()==method2()
print "Nothing in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data=method1()
print "data from method1 is in memory"
print "method1", totalsize(data)
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data
print "Nothing in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data2=method2()
print "data from method2 is in memory"
print "method2", totalsize(data2)
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data2
print "Nothing is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
print "\nPrepare to have your mind blown even more!"
data=method1()
print "Data from method1 is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data2=method2()
print "Data from method1 and method 2 are in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data==data2
print "Compared the two lists"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data
print "Data from method2 is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data2
print "Nothing is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
OUTPUT
method1 == method2 True
Nothing in memory
Usage: 0.001798144
data from method1 is in memory
method1 Object size:1.52604683
Usage: 4.552925184
Nothing in memory
Usage: 0.001798144
data from method2 is in memory
method2 Object size:1.534815518
Usage: 1.56932096
Nothing is in memory
Usage: 0.001798144
Prepare to have your mind blown even more!
Data from method1 is in memory
Usage: 4.552925184
Data from method1 and method 2 are in memory
Usage: 4.692287488
Compared the two lists
Usage: 4.692287488
Data from method2 is in memory
Usage: 4.56169472
Nothing is in memory
Usage: 0.001798144
for those of you using python3 its pretty similar, except not as bad after the comparison operation...
OUTPUT FROM PYTHON3
method1 == method2 True
Nothing in memory
Usage: 0.004395008000000006
data from method1 is in memory
method1 Object size:1.718523294
Usage: 5.322555392
Nothing in memory
Usage: 0.004395008000000006
data from method2 is in memory
method2 Object size:1.727291982
Usage: 1.872596992
Nothing is in memory
Usage: 0.004395008000000006
Prepare to have your mind blown even more!
Data from method1 is in memory
Usage: 5.322555392
Data from method1 and method 2 are in memory
Usage: 5.461917696
Compared the two lists
Usage: 5.461917696
Data from method2 is in memory
Usage: 2.747633664
Nothing is in memory
Usage: 0.004395008000000006
moral of the story... memory for python appear to be a bit like Camelot for Monty Python... 'tis a very silly place.
You can use it by putting the @profile decorator around any function or method and running python -m memory_profiler myscript. You'll see line-by-line memory usage once your script exits.
In python, the usage of sys. getsizeof() can be done to find the storage size of a particular object that occupies some space in the memory. This function returns the size of the object in bytes.
When you create a list object, the list object by itself takes 64 bytes of memory, and each item adds 8 bytes of memory to the size of the list because of references to other objects.
I'm going to suggest that you back off and approach this instead in a way that directly addresses your goal: shrinking peak memory use to begin with. No amount of analysis & fiddling later can overcome using a doomed approach to begin with ;-)
Concretely, you got off on a wrong foot at the first step, via data=f.read()
. Now it's already the case that your program can't possibly scale beyond a data file that fits entirely in RAM with room to spare (to run the OS and Python and ...) too.
Do you actually need all the data to be in RAM at one time? There are too few details to tell about later steps, but obviously not at the start, since you immediately want to throw away 75% of the lines you read.
So start off by doing that incrementally instead:
def getlines4(f):
for i, line in enumerate(f):
if i % 4 == 1:
yield line
Even if you do nothing other than just that much, you can skip directly to the result of step 3, saving an enormous amount of peak RAM use:
with open(file, 'r') as f:
data = list(getlines4(f))
Now peak RAM need is proportional to the number of bytes in the only lines you care about, instead of to the total number of file bytes period.
To continue making progress, instead of materializing all the lines of interest in data
in one giant gulp, feed the lines (or chunks of lines) incrementally to your worker processes too. There wasn't enough detail for me to suggest concrete code for that, but keep the goal in mind and you'll figure it out: you only need enough RAM to keep incrementally feeding lines to worker processes, and to save away however much of the worker processes' results you need to keep in RAM. It's possible that peak memory use doesn't need to more than "tiny", regardless of input file size.
Fighting memory management details instead is enormously harder than taking a memory-friendly approach to begin with. Python itself has several memory-management subsystems, and a great deal can be said about each of them. They in turn rely on the platform C malloc/free facilities, about which there's also a great deal to learn. And we're still not at a level that has anything directly to do with what your operating system reports for "memory use". The platform C libraries in turn rely on platform-specific OS memory managment primitives, which - typically - only OS kernel memory experts truly understand.
The answer to "why does the OS say I'm still using N GiB of RAM?" can rely on application-specific details in any one of those layers, or even on unfortunate more-or-less accidental interactions among them. Far better to arrange not to need to ask such questions to begin with.
It's great that you gave some runnable code, but not so great that nobody but you can run it since nobody else has your data ;-) Things like "how many lines are there?" and "what's the distribution of line lengths?" can be critical, but we have no way to guess.
As I noted before, application-specific details are often necessary to out-think modern memory managers. They're complex, and behavior at all the levels can be subtle.
Python's primary object allocator ("obmalloc") requests "arenas" from the platform C malloc, chunks of 2**18 bytes. So long as that's the Python memory system your application is using (which can't be guessed at because we don't have your data to work with), 256 KiB is the smallest granularity at which memory is requested from, or returned to, the C level. The C level in turn typically has "chunk things up" strategies of its own, which vary across C implementations.
A Python arena is in turn carved into 4 KiB "pools", each of which dynamically adapts to be carved into smaller chunks of a fixed size per pool (8-byte chunks, 16-bytes chunks, 24-byte chunks, ..., 8*i-byte chunks per pool).
So long as a single byte in an arena is being used for live data, the entire arena must be retained. If that means the other 262,143 arena bytes sit unused, tough luck. As your output shows, all the memory is returned in the end, so why do you really care? I understand it's an abstractly interesting puzzle, but you're not going to solve it short of making major efforts to understand the code in CPython's obmalloc.c
. For a start. Any "summary" would leave out a detail that's actually important to some application's microscopic behavior.
Plausible: your strings are short enough that space for all the string object headers and contents (the actual string data) are obtained from CPython's obmalloc. They're going to be splattered all over multiple arenas. An arena might look like this, where "H" represents pools from which string object headers are allocated, and "D" pools from which space for string data is allocated:
HHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDDHHDD...
In your method1
they'll tend to alternate "like that" because creating a single string object requires allocating space separately for the string object header and the string object data. When you go on to throw out 3/4ths of the strings you created, more-or-less 3/4ths of that space becomes reusable to Python. But not one byte can be returned to the system C because there's still live data sprayed all over the arena, containing the quarter of the string objects you didn't throw away (here "-" means space available for reuse):
HHDD------------HHDD------------HHDD------------HHDD----...
There's so much free space that, in fact, it's possible that the less wasteful method2
can get all the memory it needs from the --------
holes left over from method1
even when you don't throw away the method1
result.
Just to keep things simple ;-) , I'll note that some of those details about how CPython's obmalloc gets used vary across Python releases too. In general, the more recent the Python release, the more it tries to use obmalloc first instead of the platform C malloc/free (because obmalloc is generally faster).
But even if you use the platform C malloc/free directly, you can still see the same kinds of things happening. Kernel memory system calls are typically more expensive than running code purely in user space, so platform C malloc/free routines typically have their own strategies for "ask the kernel for much more memory than we need for a single request, and carve it up into smaller pieces ourself".
Something to note: neither Python's obmalloc nor platorm C malloc/free implementations ever move live data on their own. Both return memory addresses to clients, and those cannot change. "Holes" are an inescapable fact of life under both.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With