I have files (>100) that each contain recorded sets of data like this:
All of the data together may exceed 20 GB, so loading all of it into memory is not an option. Hence, I would like to create memory-mapped files for each of the files BUT hiding from the "user" the complexity of the underlying data, e.g., I would like to be able to operate on the data like this:
for i=1:TotalNumberOfRecordings
recording(i) = recording(i) * 10; % some stupid data operation
% or even more advanced better:
recording(i).relatedData = 2000;
end
So, no matter if recording(i)
is in file0, file1, or some other file, and no matter its position within the file, I have a list that allows to me access the related data via a memory map.
What I have so far, is a list of all files within a certain directory, my idea now was to simply create a list like this:
entry1: [memoryMappedFileHandle, dataRangeOfRecording]
entry2: [memoryMappedFileHandle, dataRangeOfRecording]
And then use this list to further abstract files and recordings. I started with this code:
fileList = getAllFiles(directoryName);
list = []; n = 0;
for file = 1:length(fileList);
m = memmapfile(fileList(file));
for numberOfTracesInFile
n = n+1;
list = [list; [n, m]];
end
end
But I do get the error:
Memmapfile objects cannot be concatenated
I'm quite new to MATLAB so this is probably a bad idea after all. How to do it better? Is it possible to create a memorymapped table that contains multiple files?
I'm not sure whether the core of your question is specifically about memory-mapped files, or about whether there is a way to seamlessly process data from multiple large files without the user needing to bother with the details of where the data is.
To address the second question, MATLAB 2014b introduced a new datastore
object that is designed to do pretty much this. Essentially, you create a datastore
object that refers to your files, and you can then pull data from the datastore
without needing to worry about which file it's in. datastore
is also designed to work very closely with the new mapreduce
functionality that was introduced at the same time, which allows you to easily parallelize map-reduce programming patterns, and even tie in with Hadoop.
To answer the first question - I'm afraid I think you've found your answer, which is that memmapfile
objects can not be concatenated, so no, not straightforward. I think your best approach would be to build your own class, which would contain multiple memmapfile
objects in a cell array along with information about which data was in which file, along with some sort of getData
method that would retrieve the appropriate data from the appropriate file. (This would be basically like writing your own datastore
class, but which worked with memory-mapped files rather than files, so you might be able to copy much of the design and/or implementation details from datastore
itself).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With