I need to write an array that is too large to fit into memory to a .mat binary file. This can be accomplished with the matfile
function, which allows random access to a .mat file on disk.
Normally, the accepted advice is to preallocate arrays, because expanding them on every iteration of a loop is slow. However, when I was asking how to do this, it occurred to me that this may not be good advice when writing to disk rather than RAM.
Will the same performance hit from growing the array apply, and if so, will it be significant when compared to the time it takes to write to disk anyway?
(Assume that the whole file will be written in one session, so the risk of serious file fragmentation is low.)
Repeatedly resizing arrays often requires MATLAB® to spend extra time looking for larger contiguous blocks of memory, and then moving the array into those blocks. Often, you can improve code execution time by preallocating the maximum amount of space required for the array.
Pre-allocating Space In Matlab, arrays are dynamic in size, meaning, whenever you need a larger array, you can just index based on this larger number and Matlab will automajically resize your original array to be the new bigger size.
Preallocating Cell Arrays with the cell Function The cell function allows you to preallocate empty cell arrays of the specified size. For example, this statement creates an empty 2-by-3 cell array. B = cell(2,3);
Q: Will the same performance hit from growing the array apply, and if so will it be significant when compared to the time it takes to write to disk anyway?
A: Yes, performance will suffer if you significantly grow a file on disk without pre-allocating. The performance hit will be a consequence of fragmentation. As you mentioned, fragmentation is less of a risk if the file is written in one session, but will cause problems if the file grows significantly.
A related question was raised on the MathWorks website, and the accepted answer was to pre-allocate when possible.
If you don't pre-allocate, then the extent of your performance problems will depend on:
Let's pretend that you're running a recent Windows OS, and so are using the NTFS file-system. Let's further assume that it has been set up with the default 4 kB cluster size. So, space on disk gets allocated in 4 kB chunks and the locations of these are indexed to the Master File Table. If the file grows and contiguous space is not available then there are only two choices:
The file system chooses to do the least-bad option, #2, and updates the MFT record to indicate where the new clusters will be on disk.
Now, the hard disk needs to physically move the read head in order to read or write the new clusters, and this is a (relatively) slow process. In terms of moving the head, and waiting for the right area of disk to spin underneath it ... you're likely to be looking at a seek time of about 10ms. So for every time you hit a fragment, there will be an additional 10ms delay whilst the HDD moves to access the new data. SSDs have much shorter seek times (no moving parts). For the sake of simplicity, we're ignoring multi-platter systems and RAID arrays!
If you keep growing the file at different times, then you may experience a lot of fragmentation. This really depends on when / how much the file is growing by, and how else you are using the hard disk. The performance hit that you experience will also depend on how often you are reading the file, and how frequently you encounter the fragments.
MATLAB stores data in Column-major order, and from the comments it seems that you're interested in performing column-wise operations (sums, averages) on the dataset. If the columns become non-contiguous on disk then you're going to hit lots of fragments on every operation!
As mentioned in the comments, both read and write actions will be performed via a buffer. As @user3666197 points out the OS can speculatively read-ahead of the current data on disk, on the basis that you're likely to want that data next. This behaviour is especially useful if the hard disk would be sitting idle at times - keeping it operating at maximum capacity and working with small parts of the data in buffer memory can greatly improve read and write performance. However, from your question it sounds as though you want to perform large operations on a huge (too big for memory) .mat file. Given your use-case, the hard disk is going to be working at capacity anyway, and the data file is too big to fit in the buffer - so these particular tricks won't solve your problem.
So ...Yes, you should pre-allocate. Yes, a performance hit from growing the array on disk will apply. Yes, it will probably be significant (it depends on specifics like amount of growth, fragmentation, etc). And if you're going to really get into the HPC spirit of things then stop what you're doing, throw away MATLAB , shard your data and try something like Apache Spark! But that's another story.
Does that answer your question?
P.S. Corrections / amendments welcome! I was brought up on POSIX inodes, so sincere apologies if there are any inaccuracies in here...
Preallocating a variable in RAM and preallocating on the disk don't solve the same problem.
To expand a matrix in RAM, MATLAB creates a new matrix with the new size and copies the values of the old matrix into the new one and deletes the old one. This costs a lot of performance.
If you preallocated the matrix, the size of it does not change. So there is no more reason for MATLAB to do this matrix copying anymore.
The problem on the hard-disk is fragmentation as GnomeDePlume said. Fragmentation will still be a problem, even if the file is written in one session.
Here is why: The hard disk will generally be a little fragmentated. Imagine
#
to be memory blocks on the hard disk that are full M
to be memory blocks on the hard disk that will be used to save data of your matrix-
to be free memory blocks on the hard diskNow the hard disk could look like this before you write the matrix onto it:
###--##----#--#---#--------------------##-#---------#---#----#------
When you write parts of the matrix (e.g. MMM
blocks) you could imagine the process to look like this >!(I give an example where the file system will just go from left to right and use the first free space that is big enough - real file systems are different):
###--##MMM-#--#---#--------------------##-#---------#---#----#------
###--##MMM-#--#MMM#--------------------##-#---------#---#----#------
###--##MMM-#--#MMM#MMM-----------------##-#---------#---#----#------
Clearly the matrix file on the hard disk is fragmented although we wrote it without doing anything else in the meantime.
This can be better if the matrix file was preallocated. In other words, we tell the file system how big our file would be, or in this example, how many memory blocks we want to reserve for it.
Imagine the matrix needed 12 blocks: MMMMMMMMMMMM
. We tell the file system that we need so much by preallocating and it will try to accomodate our needs as best as it can. In this example, we are lucky: There is free space with >= 12 memory blocks.
###--##----#--#---# (------------) --------##-#---------#---#----#------
###--##----#--#---# (MMM---------) --------##-#---------#---#----#------
###--##----#--#---# (MMMMMM------) --------##-#---------#---#----#------
###--##----#--#---# (MMMMMMMMM---) --------##-#---------#---#----#------
###--##----#--#---# (MMMMMMMMMMMM) --------##-#---------#---#----#------
Voilá, no fragmentation!
Generally you could imagine this process as buying cinema tickets for a large group. You would like to stick together as a group, but there are already some seats in the theatre reserved by other people. For the cashier to be able to accomodate to your request (large group wants to stick together), he/she needs knowledge about how big your group is (preallocating).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With