Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory efficient Python batch processing

question

I wrote a small python batch processor, that loads binary data, performs numpy operations and stores the results. It consumes much more memory, than it should. I looked at similar stack-overflow discussions and would like to ask for further recommendations.

background

I convert spectral data to rgb. The spectral data is stored in a Band Interleaved by Line (BIL) Image File. That is why I read and process the data line by line. I read the data using the Spectral Python Library, which returns a numpy arrays. hyp is a descriptor of a large spectral file : hyp.ncols=1600, hyp.nrows=3430, hyp.nbands=160

code

import spectral
import numpy as np
import scipy

class CIE_converter (object):
   def __init__(self, cie):
       self.cie = cie

    def interpolateBand_to_cie_range(self, hyp, hyp_line):
       interp = scipy.interpolate.interp1d(hyp.bands.centers,hyp_line, kind='cubic',bounds_error=False, fill_value=0)
       return interp(self.cie[:,0])

    #@profile
    def spectrum2xyz(self, hyp):
       out = np.zeros((hyp.ncols,hyp.nrows,3))
       spec_line = hyp.read_subregion((0,1), (0,hyp.ncols)).squeeze()
       spec_line_int = self.interpolateBand_to_cie_range(hyp, spec_line)
       for ii in xrange(hyp.nrows):
          spec_line = hyp.read_subregion((ii,ii+1), (0,hyp.ncols)).squeeze()
          spec_line_int = self.interpolateBand_to_cie_range(hyp,spec_line)
          out[:,ii,:] = np.dot(spec_line_int,self.cie[:,1:4])
       return out

memory consumption

All the big data is initialised outside the loop. My naive interpretation was that the memory consumption should not increase (Have I used too much Matlab?) Can someone explain me the increase factor 10? This is not linear,as hyp.nrows = 3430. Are there any recommendations to improve the memory management?

  Line #    Mem usage    Increment   Line Contents
  ================================================
  76                                 @profile
  77     60.53 MB      0.00 MB       def spectrum2xyz(self, hyp):
  78    186.14 MB    125.61 MB           out = np.zeros((hyp.ncols,hyp.nrows,3))
  79    186.64 MB      0.50 MB           spec_line = hyp.read_subregion((0,1), (0,hyp.ncols)).squeeze()
  80    199.50 MB     12.86 MB           spec_line_int = self.interpolateBand_to_cie_range(hyp, spec_line)
  81                             
  82   2253.93 MB   2054.43 MB           for ii in xrange(hyp.nrows):
  83   2254.41 MB      0.49 MB               spec_line = hyp.read_subregion((ii,ii+1), (0,hyp.ncols)).squeeze()
  84   2255.64 MB      1.22 MB               spec_line_int = self.interpolateBand_to_cie_range(hyp, spec_line)
  85   2235.08 MB    -20.55 MB               out[:,ii,:] = np.dot(spec_line_int,self.cie[:,1:4])
  86   2235.08 MB      0.00 MB           return out

notes

I replaced range by xrange without drastic improvement. I'm aware that a cubic interpolation is not the fastest, but this is not about CPU consumption.

like image 506
user421929 Avatar asked Nov 13 '22 17:11

user421929


1 Answers

Thanks for the comments. They all helped me to improve the memory consumption a little. But eventually I figured out what the main reason for the Memory consumption was/is:

SpectralPython Images contain a Numpy Memmap object. This has the same format, as the data structure of the hyperspectral data cube. (in case of a BIL format (nrows, nbands, ncols)) When calling:

spec_line = hyp.read_subregion((ii,ii+1), (0,hyp.ncols)).squeeze()

the image is not only returned as a numpy array return value, but also cached in hyp.memmap. A second call would be faster, but in my case the memory just increases until the OS is complaining. As the memmap is actually a great implementation, I will take direct advantage of it in future work.

like image 93
user421929 Avatar answered Dec 13 '22 15:12

user421929