I'm working on a project that involves using python to read, process and write files that are sometimes as large as a few hundred megabytes. The program fails occasionally when I try to process some particularly large files. It does not say 'memory error', but I suspect that is the problem (in fact it gives no reason at all for failing').
I've been testing the code on smaller files and watching 'top' to see what memory usage is like, and it typically reaches 60%. top says that I have 4050352k total memory, so 3.8Gb.
Meanwhile I'm trying to track memory usage within python itself (see my question from yesterday) with the following little bit of code:
mem = 0
for variable in dir():
variable_ = vars()[variable]
try:
if str(type(variable_))[7:12] == 'numpy':
numpy_ = True
else:
numpy_ = False
except:
numpy_ = False
if numpy_:
mem_ = variable_.nbytes
else:
mem_ = sys.getsizeof(variable)
mem += mem_
print variable+ type: '+str(type(variable_))+' size: '+str(mem_)
print 'Total: '+str(mem)
Before I run that block I set all the variables I don't need to None, close all files and figures, etc etc. After that block I use subprocess.call() to run a fortran program that is required for the next stage of processing. Looking at top while the fortran program is running shows that the fortran program is using ~100% of the cpu, and ~5% of the memory, and that python is using 0% of cpu and 53% of memory. However my little snippet of code tells me that all of the variables in python add up to only 23Mb, which ought to be ~0.5%.
So what's happening? I wouldn't expect that little snippet to give me a spot on memory usage, but it ought to be accurate to within a few Mb surely? Or is it just that top doesn't notice the memory has been relinquished, but that it is available to other programs that need it if necessary?
As requested, here's a simplified part of the code that is using up all the memory (file_name.cub is an ISIS3 cube, it's a file that contains 5 layers (bands) of the same map, the first layer is spectral radiance, the next 4 have to do with latitude, longitude, and other details. It's an image from Mars that I'm trying to process. StartByte is a value I previously read from the .cub file's ascii header telling me the beginning byte of the data, Samples and Lines are the dimensions of the map, also read from the header.):
latitude_array = 'cheese' # It'll make sense in a moment
f_to = open('To_file.dat','w')
f_rad = open('file_name.cub', 'rb')
f_rad.seek(0)
header=struct.unpack('%dc' % (StartByte-1), f_rad.read(StartByte-1))
header = None
#
f_lat = open('file_name.cub', 'rb')
f_lat.seek(0)
header=struct.unpack('%dc' % (StartByte-1), f_lat.read(StartByte-1))
header = None
pre=struct.unpack('%df' % (Samples*Lines), f_lat.read(Samples*Lines*4))
pre = None
#
f_lon = open('file_name.cub', 'rb')
f_lon.seek(0)
header=struct.unpack('%dc' % (StartByte-1), f_lon.read(StartByte-1))
header = None
pre=struct.unpack('%df' % (Samples*Lines*2), f_lon.read(Samples*Lines*2*4))
pre = None
# (And something similar for the other two bands)
# So header and pre are just to get to the right part of the file, and are
# then set to None. I did try using seek(), but it didn't work for some
# reason, and I ended up with this technique.
for line in range(Lines):
sample_rad = struct.unpack('%df' % (Samples), f_rad.read(Samples*4))
sample_rad = np.array(sample_rad)
sample_rad[sample_rad<-3.40282265e+38] = np.nan
# And Similar lines for all bands
# Then some arithmetic operations on some of the arrays
i = 0
for value in sample_rad:
nextline = sample_lat[i]+', '+sample_lon[i]+', '+value # And other stuff
f_to.write(nextline)
i += 1
if radiance_array == 'cheese': # I'd love to know a better way to do this!
radiance_array = sample_rad.reshape(len(sample_rad),1)
else:
radiance_array = np.append(radiance_array, sample_rad.reshape(len(sample_rad),1), axis=1)
# And again, similar operations on all arrays. I end up with 5 output arrays
# with dimensions ~830*4000. For the large files they can reach ~830x20000
f_rad.close()
f_lat.close()
f_to.close() # etc etc
sample_lat = None # etc etc
sample_rad = None # etc etc
#
plt.figure()
plt.imshow(radiance_array)
# I plot all the arrays, for diagnostic reasons
plt.show()
plt.close()
radiance_array = None # etc etc
# I set all arrays apart from one (which I need to identify the
# locations of nan in future) to None
# LOCATION OF MEMORY USAGE MONITOR SNIPPET FROM ABOVE
So I lied in the comments about opening several files, it's many instances of the same file. I only continue with one array that isn't set to None, and it's size is ~830x4000, though this somehow constitutes 50% of my available memory. I've also tried gc.collect, but no change. I'd be very happy to hear any advice on how I could improve on any of that code (related to this problem or otherwise).
Perhaps I should mention: originally I was opening the files in full (i.e. not line by line as above), doing it line by line was an initial attempt to save memory.
Just because you've deferenced your variables doesn't mean the Python process has given the allocated memory back to the system. See How can I explicitly free memory in Python?.
If gc.collect()
does not work for you, investigate forking and reading/writing your files in child processes using IPC. Those processes will end when they're finished and release the memory back to the system. Your main process will continue to run with low memory usage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With