Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Excessively large overhead in MATLAB .mat file

Tags:

matlab

I am parsing a large text file full of data and then saving it to disk as a *.mat file so that I can easily load in only parts of it (see here for more information on reading in the files, and here for the data). To do so, I read in one line at a time, parse the line, and then append it to the file. The problem is that the file itself is >3 orders of magnitude larger than the data contained therein!

Here is a stripped down version of my code:

database = which('01_hit12.par');
[directory,filename,~] = fileparts(database);
matObj = matfile(fullfile(directory,[filename '.mat']),'Writable',true);

fidr = fopen(database);
hitranTemp = fgetl(fidr);
k = 1;
while ischar(hitranTemp)
    if abs(hitranTemp(1)) == 32;
        hitranTemp(1) = '0';
    end

    hitran = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2u%1c%7f%7f','delimiter','','whitespace','');

    matObj.moleculeNumber(1,k)      = uint8(hitran{1});
    matObj.isotopeologueNumber(1,k) = uint8(hitran{2});
    matObj.vacuumWavenumber(1,k)    = hitran{3};
    matObj.lineIntensity(1,k)       = hitran{4};
    matObj.airWidth(1,k)            = single(hitran{6});
    matObj.selfWidth(1,k)           = single(hitran{7});
    matObj.lowStateE(1,k)           = single(hitran{8});
    matObj.tempDependWidth(1,k)     = single(hitran{9});
    matObj.pressureShift(1,k)       = single(hitran{10});

    if rem(k,1e4) == 0;
        display(sprintf('line %u (%2.2f)',k,100*k/K));
    end
    hitranTemp = fgetl(fidr);
    k = k + 1;
end
fclose(fidr);

I stopped the code after 13,813 of the 224,515 lines had been parsed because it had been taking a very long time and the file size was getting huge, but the last printout indicated that I had only just cleared 10k lines. I cleared the memory, and then ran:

S = whos('-file','01_hit12.mat');
fileBytes = sum([S.bytes]);

T = dir(which('01_hit12.mat'));
diskBytes = T.bytes;

disp([fileBytes diskBytes diskBytes/fileBytes])

and get the output:

524894 896189009 1707.37141022759

What is taking up the extra 895,664,115 bytes? I know the help page says there should be a little extra overhead, but I feel that nearly a Gb of descriptive header is a bit excessive!

New information:
I tried pre-allocating the file, thinking that perhaps MATLAB was doing the same thing it does when a matrix is embiggened in a loop and reallocating a chunk of disk space for the entire matrix on each write, and that isn't it. Filling the file with zeros of the appropriate data types results in a file that my short check script returns:

8531570 71467 0.00837677004349727

This makes more sense to me. Matlab is saving the file sparsely, so the disk file size is much smaller than the size of the full matrix in memory. Once it starts replacing values with real data, however, I get the same behavior as before and the file size starts skyrocketing beyond all reasonable bounds.

New new information:
Tried this on a subset of the data, 100 lines long. To stream to disk, the data has to be in v7.3 format, so I ran the subset through my script, loaded it into memory, and then resaved as v7.0 format. Here are the results:

v7.3: 3800 8752 2.30
v7.0: 3800 2561 0.67

No wonder the v7.3 format isn't the default. Does anyone know a way around this? Is this a bug or a feature?

like image 979
craigim Avatar asked Oct 24 '13 23:10

craigim


1 Answers

This seems like a bug to me. A workaround is to write in chunks to pre-allocated arrays.

Start off by pre-allocating:

fid = fopen('01_hit12.par', 'r');
data = fread(fid, inf, 'uint8');
nlines = nnz(data == 10) + 1;
fclose(fid);

matObj.moleculeNumber = zeros(1,nlines,'uint8');
matObj.isotopeologueNumber = zeros(1,nlines,'uint8');
matObj.vacuumWavenumber = zeros(1,nlines,'double');
matObj.lineIntensity = zeros(1,nlines,'double');
matObj.airWidth = zeros(1,nlines,'single');
matObj.selfWidth = zeros(1,nlines,'single');
matObj.lowStateE = zeros(1,nlines,'single');
matObj.tempDependWidth = zeros(1,nlines,'single');
matObj.pressureShift = zeros(1,nlines,'single');

Then to write in chunks of 10000, I modified your code as follows:

... % your code plus pre-alloc first
bs = 10000;
while ischar(hitranTemp)
    if abs(hitranTemp(1)) == 32;
        hitranTemp(1) = '0';
    end

    for ii = 1:bs,
        hitran{ii} = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2    u%1c%7f%7f','delimiter','','whitespace','');
        hitranTemp = fgetl(fidr);
        if hitranTemp==-1, bs=ii; break; end
    end

    % this part really ugly, sorry! trying to keep it compact...
    matObj.moleculeNumber(1,k:k+bs-1)      = uint8(builtin('_paren',cellfun(@(c)c{1},hitran),1:bs));
    matObj.isotopeologueNumber(1,k:k+bs-1) = uint8(builtin('_paren',cellfun(@(c)c{2},hitran),1:bs));
    matObj.vacuumWavenumber(1,k:k+bs-1)    = builtin('_paren',cellfun(@(c)c{3},hitran),1:bs);
    matObj.lineIntensity(1,k:k+bs-1)       = builtin('_paren',cellfun(@(c)c{4},hitran),1:bs);
    matObj.airWidth(1,k:k+bs-1)            = single(builtin('_paren',cellfun(@(c)c{5},hitran),1:bs));
    matObj.selfWidth(1,k:k+bs-1)           = single(builtin('_paren',cellfun(@(c)c{6},hitran),1:bs));
    matObj.lowStateE(1,k:k+bs-1)           = single(builtin('_paren',cellfun(@(c)c{7},hitran),1:bs));
    matObj.tempDependWidth(1,k:k+bs-1)     = single(builtin('_paren',cellfun(@(c)c{8},hitran),1:bs));
    matObj.pressureShift(1,k:k+bs-1)       = single(builtin('_paren',cellfun(@(c)c{9},hitran),1:bs));

    k = k + bs;
    fprintf('.');
end
fclose(fidr);

The final size on disk is 21,393,408 bytes. The usage breaks down as,

>> S = whos('-file','01_hit12.mat');
>> fileBytes = sum([S.bytes]);
>> T = dir(which('01_hit12.mat'));
>> diskBytes = T.bytes; ratio = diskBytes/fileBytes;
>> fprintf('%10d whos\n%10d disk\n%10.6f\n',fileBytes,diskBytes,ratio)
   8531608 whos
  21389582 disk
  2.507099

Still fairly inefficient, but not out of control.

like image 81
chappjc Avatar answered Oct 11 '22 14:10

chappjc