Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preallocating a large array in a MATLAB matfile with something other than zeroes

I need to write an array that is too large to fit into memory to a .mat binary file. This can be accomplished with the matfile command, which allows random access to a .mat file on disc.

I am trying to preallocate the array in this file, and the approach recommended by a MathWorks blog is

matObj = matfile('myBigData.mat','Writable',true); 
matObj.X(10000,10000) = 0;

This works, but leaves me with a large array of zeroes - which is risky, as some of the genuine values that I will be populating it with may also be zero. For smaller arrays, I would typically do

smallarray = nan(20,20);

But if I try this approach for the large array I get an "out of memory" error; presumably the nan() function is producing the large array of NaNs in memory first.

How can I preallocate a large array with something other than zeroes?

like image 609
Flyto Avatar asked Sep 30 '22 18:09

Flyto


1 Answers

I found that neither sclarke81's nor Sam Robert's answers actually work, and I doubt that the concept of preallocation applies to matfile. Results reported below were obtained on an i7-3770 CPU @ 3.4 GHz with 16.8 GB of main memory, running Matlab R2013a on Linux 3.16.

The code

mf = matfile(fn, 'Writable', true);
mf.x(5000, 200000) = 0;
clear mf

theoretically "allocates" 8 GB of memory on disk, initialized to 0. However, the resulting file has a size of 4726 bytes and the process takes less than 0.01 seconds. I can increase the size 10- or 100-fold, and nothing much changes. Strange. Btw., the clear at the end is there to ensure that the file is written and closed by Matlab.

Often we want to preallocate initializing to NaN instead of 0. Doing this the received way

mf = matfile(fn, 'Writable', true);
mf.x = nan(5000, 200000);
clear mf

takes 11 seconds and results in a file of 57 MB. But as the OP pointed out, this approach doesn't make sense because it first generates the whole matrix of 8 GB in memory and then writes it out, which defeats the purpose of matfile. If the matrix fits into memory, there's no reason in the first place to keep the data in a file while processing it.

Sam Roberts proposed to first allocate/initialize to 0 as above, and then change the values to NaN:

mf = matfile(fn, 'Writable', true);
mf.x(5000, 200000) = 0;
mf.x = mf.x * nan;
clear mf

This takes 16 seconds, with the same resulting file size. However, this in no way better than the naive approach above, because on the third line the whole matrix is read into memory, multiplied with scalar NaN in-memory, and then written out again, leading to a peak memory consumption of 8 GB. (This is not only consistent with the semantics of matfile-variables explained in the documentation, but I also checked with a memory usage monitor.)

sclarke81 proposed to instead avoid generation of the matrix in memory this way:

mf = matfile(fn, 'Writable', true);
mf.x(1 : 5000, 1 : 200000) = nan;
clear mf

the idea probably being that only a scalar NaN is generated in memory, and then copied into every element of the on-disk matrix. However, that's not what happens. As a matter of fact, this method appears to consume about 8.38 GB of memory at peak, 12 % more than the naive approach!

Now more on the merits of preallocation with matfile. If one does not preallocate, but fills the array row-wise with NaNs

mf = matfile(fn, 'Writable', true);
for i = 1 : 5000
    mf.x(i, 1 : 200000) = nan(1, 200000);
end
clear mf

this takes 27 seconds. But, if one preallocates initializing to 0 and then row-wise overwrites by NaNs

mf = matfile(fn, 'Writable', true);
mf.x(5000, 200000) = 0;
for i = 1 : 5000
    mf.x(i, 1 : 200000) = nan(1, 200000);
end
clear mf

it takes ages: the process was only about 3% finished when I aborted it after 45 minutes, extrapolating to about a day of total runtime!

The behavior of matlab.io.MatFile is dark and mysterious, and it seems that at the moment, only testing extensively will lead to an effective way to use this facility. However, one may conclude that preallocation is a bad idea when it comes to matfile.

like image 115
A. Donda Avatar answered Oct 10 '22 12:10

A. Donda