Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When writing a large array directly to disk in MATLAB, is there any need to preallocate?

I need to write an array that is too large to fit into memory to a .mat binary file. This can be accomplished with the matfile function, which allows random access to a .mat file on disk.

Normally, the accepted advice is to preallocate arrays, because expanding them on every iteration of a loop is slow. However, when I was asking how to do this, it occurred to me that this may not be good advice when writing to disk rather than RAM.

Will the same performance hit from growing the array apply, and if so, will it be significant when compared to the time it takes to write to disk anyway?

(Assume that the whole file will be written in one session, so the risk of serious file fragmentation is low.)

like image 951
Flyto Avatar asked Oct 01 '14 11:10

Flyto


People also ask

Why is it important to preallocate memory in MATLAB?

Repeatedly resizing arrays often requires MATLAB® to spend extra time looking for larger contiguous blocks of memory, and then moving the array into those blocks. Often, you can improve code execution time by preallocating the maximum amount of space required for the array.

What does it mean to preallocate an array in MATLAB?

Pre-allocating Space In Matlab, arrays are dynamic in size, meaning, whenever you need a larger array, you can just index based on this larger number and Matlab will automajically resize your original array to be the new bigger size.

What function would you use to Preallocate a cell array?

Preallocating Cell Arrays with the cell Function The cell function allows you to preallocate empty cell arrays of the specified size. For example, this statement creates an empty 2-by-3 cell array. B = cell(2,3);


2 Answers

Q: Will the same performance hit from growing the array apply, and if so will it be significant when compared to the time it takes to write to disk anyway?

A: Yes, performance will suffer if you significantly grow a file on disk without pre-allocating. The performance hit will be a consequence of fragmentation. As you mentioned, fragmentation is less of a risk if the file is written in one session, but will cause problems if the file grows significantly.

A related question was raised on the MathWorks website, and the accepted answer was to pre-allocate when possible.

If you don't pre-allocate, then the extent of your performance problems will depend on:

  • your filesystem (how data are stored on disk, the cluster-size),
  • your hardware (HDD seek time, or SSD access times),
  • the size of your mat file (whether it moves into non-contiguous space),
  • and the current state of your storage (existing fragmentation / free space).

Let's pretend that you're running a recent Windows OS, and so are using the NTFS file-system. Let's further assume that it has been set up with the default 4 kB cluster size. So, space on disk gets allocated in 4 kB chunks and the locations of these are indexed to the Master File Table. If the file grows and contiguous space is not available then there are only two choices:

  1. Re-write the entire file to a new part of the disk, where there is sufficient free space.
  2. Fragment the file, storing the additional data at a different physical location on disk.

The file system chooses to do the least-bad option, #2, and updates the MFT record to indicate where the new clusters will be on disk.

Illustration of fragmented file on NTFS, from WindowsITPro

Now, the hard disk needs to physically move the read head in order to read or write the new clusters, and this is a (relatively) slow process. In terms of moving the head, and waiting for the right area of disk to spin underneath it ... you're likely to be looking at a seek time of about 10ms. So for every time you hit a fragment, there will be an additional 10ms delay whilst the HDD moves to access the new data. SSDs have much shorter seek times (no moving parts). For the sake of simplicity, we're ignoring multi-platter systems and RAID arrays!

If you keep growing the file at different times, then you may experience a lot of fragmentation. This really depends on when / how much the file is growing by, and how else you are using the hard disk. The performance hit that you experience will also depend on how often you are reading the file, and how frequently you encounter the fragments.

MATLAB stores data in Column-major order, and from the comments it seems that you're interested in performing column-wise operations (sums, averages) on the dataset. If the columns become non-contiguous on disk then you're going to hit lots of fragments on every operation!

As mentioned in the comments, both read and write actions will be performed via a buffer. As @user3666197 points out the OS can speculatively read-ahead of the current data on disk, on the basis that you're likely to want that data next. This behaviour is especially useful if the hard disk would be sitting idle at times - keeping it operating at maximum capacity and working with small parts of the data in buffer memory can greatly improve read and write performance. However, from your question it sounds as though you want to perform large operations on a huge (too big for memory) .mat file. Given your use-case, the hard disk is going to be working at capacity anyway, and the data file is too big to fit in the buffer - so these particular tricks won't solve your problem.

So ...Yes, you should pre-allocate. Yes, a performance hit from growing the array on disk will apply. Yes, it will probably be significant (it depends on specifics like amount of growth, fragmentation, etc). And if you're going to really get into the HPC spirit of things then stop what you're doing, throw away MATLAB , shard your data and try something like Apache Spark! But that's another story.

Does that answer your question?

P.S. Corrections / amendments welcome! I was brought up on POSIX inodes, so sincere apologies if there are any inaccuracies in here...

like image 54
GnomeDePlume Avatar answered Nov 15 '22 23:11

GnomeDePlume


Preallocating a variable in RAM and preallocating on the disk don't solve the same problem.

In RAM

To expand a matrix in RAM, MATLAB creates a new matrix with the new size and copies the values of the old matrix into the new one and deletes the old one. This costs a lot of performance.

If you preallocated the matrix, the size of it does not change. So there is no more reason for MATLAB to do this matrix copying anymore.

On the hard-disk

The problem on the hard-disk is fragmentation as GnomeDePlume said. Fragmentation will still be a problem, even if the file is written in one session.

Here is why: The hard disk will generally be a little fragmentated. Imagine

  • # to be memory blocks on the hard disk that are full
  • M to be memory blocks on the hard disk that will be used to save data of your matrix
  • - to be free memory blocks on the hard disk

Now the hard disk could look like this before you write the matrix onto it:

###--##----#--#---#--------------------##-#---------#---#----#------

When you write parts of the matrix (e.g. MMM blocks) you could imagine the process to look like this >!(I give an example where the file system will just go from left to right and use the first free space that is big enough - real file systems are different):

  1. First matrix part:
    ###--##MMM-#--#---#--------------------##-#---------#---#----#------
  2. Second matrix part: ###--##MMM-#--#MMM#--------------------##-#---------#---#----#------
  3. Third matrix part: ###--##MMM-#--#MMM#MMM-----------------##-#---------#---#----#------
  4. And so on ...

Clearly the matrix file on the hard disk is fragmented although we wrote it without doing anything else in the meantime.

This can be better if the matrix file was preallocated. In other words, we tell the file system how big our file would be, or in this example, how many memory blocks we want to reserve for it.

Imagine the matrix needed 12 blocks: MMMMMMMMMMMM. We tell the file system that we need so much by preallocating and it will try to accomodate our needs as best as it can. In this example, we are lucky: There is free space with >= 12 memory blocks.

  1. Preallocating (We need 12 memory blocks):
    ###--##----#--#---# (------------) --------##-#---------#---#----#------
    The file system reserves the space between the parentheses for our matrix and will write into there.
  2. First matrix part:
    ###--##----#--#---# (MMM---------) --------##-#---------#---#----#------
  3. Second matrix part:
    ###--##----#--#---# (MMMMMM------) --------##-#---------#---#----#------
  4. Third matrix part:
    ###--##----#--#---# (MMMMMMMMM---) --------##-#---------#---#----#------
  5. Fourth and last part of the matrix:
    ###--##----#--#---# (MMMMMMMMMMMM) --------##-#---------#---#----#------

Voilá, no fragmentation!


Analogy

Generally you could imagine this process as buying cinema tickets for a large group. You would like to stick together as a group, but there are already some seats in the theatre reserved by other people. For the cashier to be able to accomodate to your request (large group wants to stick together), he/she needs knowledge about how big your group is (preallocating).

like image 30
YinglaiYang Avatar answered Nov 16 '22 00:11

YinglaiYang