Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does saving mat files with scipy result in larger file size than with Matlab?

Let's say I generate the following toy dataset from Matlab, and I save it as a mat file:

>> arr = rand(100);
>> whos arr
  Name        Size             Bytes  Class     Attributes

  arr       100x100            80000  double
>> save('arr.mat', 'arr')

The saved arr.mat file is of size 75829 Bytes according to the output of the ls command.

If I load the same file using scipy.io.loadmat() and save it again using scipy.io.savemat():

arr = io.loadmat('arr.mat')
with open('arrscipy.mat', 'w') as f:
    io.savemat(f, arr)

I obtain a file with a considerably different size (∼ 4KB larger):

$ ls -al
75829 Nov  6 11:52 arr.mat
80184 Nov  6 11:52 arrscipy.mat

I now have two binary mat files containing the same data. My understanding is that the size of a binary mat file is determined by the size of its contained variables, plus some overhead due to file headers. However the sizes of these two files are considerably different. Why is this? Is it a data format problem?

I tried this with arrays of structures too, and the result is similar: scipy-saved mat files are larger than Matlab-saved ones.

like image 308
JoErNanO Avatar asked Nov 06 '15 11:11

JoErNanO


2 Answers

Look at the docs:

scipy.io.savemat(file_name, mdict, appendmat=True, format='5',
    long_field_names=False, do_compression=False, oned_as='row')

Compression is turned off by default. In matlab compression is always turned on.

like image 156
Daniel Avatar answered Sep 19 '22 08:09

Daniel


There's a catch when you set do_compression=True. For large files, MATLAB cannot load when saved with do_compression=True.

In my case, mat files under 2 GB didn't have any problem loading from my MATLAB (2017b) whether do_compression is True or False, but when I load 2.25 GB mat file saved using scipy.io.savemat() with compression, MATLAB failed to load even though I can load it from Python using loadmat().

In scipy.io.savemat manual, the default value of format = '5', which supports up to MATLAB 7.2. It is the latest version it supports. In MATLAB's save() documentation, however, it says it needs to be saved with '-v7.3' for files over 2GB. I think the reason scipy's savemat fails to save correctly is because it doesn't support MATLAB 7.3 version for mat files larger than 2GB.

Hopefully scipy will have an upgrade to fix this problem.

like image 43
dbdq Avatar answered Sep 20 '22 08:09

dbdq