Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does my python process use up so much memory?

I'm working on a project that involves using python to read, process and write files that are sometimes as large as a few hundred megabytes. The program fails occasionally when I try to process some particularly large files. It does not say 'memory error', but I suspect that is the problem (in fact it gives no reason at all for failing').

I've been testing the code on smaller files and watching 'top' to see what memory usage is like, and it typically reaches 60%. top says that I have 4050352k total memory, so 3.8Gb.

Meanwhile I'm trying to track memory usage within python itself (see my question from yesterday) with the following little bit of code:

mem = 0
for variable in dir():
    variable_ = vars()[variable]
    try: 
        if str(type(variable_))[7:12] == 'numpy':
            numpy_ = True
        else:
            numpy_ = False
    except:
        numpy_ = False
    if numpy_:
        mem_ = variable_.nbytes
    else:
        mem_ = sys.getsizeof(variable)
    mem += mem_
    print variable+ type: '+str(type(variable_))+' size: '+str(mem_)
print 'Total: '+str(mem)

Before I run that block I set all the variables I don't need to None, close all files and figures, etc etc. After that block I use subprocess.call() to run a fortran program that is required for the next stage of processing. Looking at top while the fortran program is running shows that the fortran program is using ~100% of the cpu, and ~5% of the memory, and that python is using 0% of cpu and 53% of memory. However my little snippet of code tells me that all of the variables in python add up to only 23Mb, which ought to be ~0.5%.

So what's happening? I wouldn't expect that little snippet to give me a spot on memory usage, but it ought to be accurate to within a few Mb surely? Or is it just that top doesn't notice the memory has been relinquished, but that it is available to other programs that need it if necessary?

As requested, here's a simplified part of the code that is using up all the memory (file_name.cub is an ISIS3 cube, it's a file that contains 5 layers (bands) of the same map, the first layer is spectral radiance, the next 4 have to do with latitude, longitude, and other details. It's an image from Mars that I'm trying to process. StartByte is a value I previously read from the .cub file's ascii header telling me the beginning byte of the data, Samples and Lines are the dimensions of the map, also read from the header.):

latitude_array = 'cheese'   # It'll make sense in a moment
f_to = open('To_file.dat','w') 

f_rad = open('file_name.cub', 'rb')
f_rad.seek(0)
header=struct.unpack('%dc' % (StartByte-1), f_rad.read(StartByte-1))
header = None    
#
f_lat = open('file_name.cub', 'rb')
f_lat.seek(0)
header=struct.unpack('%dc' % (StartByte-1), f_lat.read(StartByte-1))
header = None 
pre=struct.unpack('%df' % (Samples*Lines), f_lat.read(Samples*Lines*4))
pre = None
#
f_lon = open('file_name.cub', 'rb')
f_lon.seek(0)
header=struct.unpack('%dc' % (StartByte-1), f_lon.read(StartByte-1))
header = None 
pre=struct.unpack('%df' % (Samples*Lines*2), f_lon.read(Samples*Lines*2*4))
pre = None
# (And something similar for the other two bands)
# So header and pre are just to get to the right part of the file, and are 
# then set to None. I did try using seek(), but it didn't work for some
# reason, and I ended up with this technique.
for line in range(Lines):
    sample_rad = struct.unpack('%df' % (Samples), f_rad.read(Samples*4))
    sample_rad = np.array(sample_rad)
    sample_rad[sample_rad<-3.40282265e+38] = np.nan  
    # And Similar lines for all bands
    # Then some arithmetic operations on some of the arrays
    i = 0
    for value in sample_rad:
        nextline = sample_lat[i]+', '+sample_lon[i]+', '+value # And other stuff
        f_to.write(nextline)
        i += 1
    if radiance_array == 'cheese':  # I'd love to know a better way to do this!
        radiance_array = sample_rad.reshape(len(sample_rad),1)
    else:
        radiance_array = np.append(radiance_array, sample_rad.reshape(len(sample_rad),1), axis=1)
        # And again, similar operations on all arrays. I end up with 5 output arrays
        # with dimensions ~830*4000. For the large files they can reach ~830x20000
f_rad.close()
f_lat.close()
f_to.close()   # etc etc 
sample_lat = None  # etc etc
sample_rad = None  # etc etc

#
plt.figure()
plt.imshow(radiance_array)
# I plot all the arrays, for diagnostic reasons

plt.show()
plt.close()

radiance_array = None  # etc etc
# I set all arrays apart from one (which I need to identify the 
# locations of nan in future) to None

# LOCATION OF MEMORY USAGE MONITOR SNIPPET FROM ABOVE

So I lied in the comments about opening several files, it's many instances of the same file. I only continue with one array that isn't set to None, and it's size is ~830x4000, though this somehow constitutes 50% of my available memory. I've also tried gc.collect, but no change. I'd be very happy to hear any advice on how I could improve on any of that code (related to this problem or otherwise).

Perhaps I should mention: originally I was opening the files in full (i.e. not line by line as above), doing it line by line was an initial attempt to save memory.

like image 803
EddyTheB Avatar asked Aug 03 '12 17:08

EddyTheB


1 Answers

Just because you've deferenced your variables doesn't mean the Python process has given the allocated memory back to the system. See How can I explicitly free memory in Python?.

If gc.collect() does not work for you, investigate forking and reading/writing your files in child processes using IPC. Those processes will end when they're finished and release the memory back to the system. Your main process will continue to run with low memory usage.

like image 180
bioneuralnet Avatar answered Oct 01 '22 15:10

bioneuralnet