I am writing a matlab program, which reads about 500 files. Each file has 20,000 lines, with 1 number on each line. The program tries to build a matrix of 20,000 * 500 with these numbers. The numbers are stored as Double, so 8 bytes per number. So I would expect this to take 20,000 * 500 * 8 bytes, which is approximately 1E8, i.e. 100MB. And yet this program exhausts my 16GB memory. As the program runs, I see the memory use steadily going up, GB by GB. I am using Matlab R2015b on Ubuntu 14.04.
What's happening? Many thanks for your attention.
Here is the full code
clear all;
% number of rna bits in the file
filesize = 20532
maxFiles = 480;
rnaCounts = NaN(filesize,maxFiles);
myFolder = '~/_STATS/data3/RNASeqV2/UNC__IlluminaHiSeq_RNASeqV2/Level_3';
filePattern = fullfile(myFolder, '*genes.normalized_results');
theFiles = dir(filePattern);
rnaCounts = NaN(filesize,length(theFiles));
for k = 1 : length(theFiles)
mrnaFilename = strtrim(theFiles(k).name);
fprintf(1, 'Now reading mrnaFile %d %s \n', k, mrnaFilename);
% read rna file
fullFileName = fullfile(myFolder, mrnaFilename);
rnafid = fopen(fullFileName);
if rnafid < 0
fprintf('====ERROR OPENING RNA FILE =====================');
end
rnaline = fgets(rnafid);
lc = 1; % line counter
while ischar(rnaline) && feof(rnafid) ~= 1
rnaline = fgets(rnafid);
rnaSplit = strsplit(rnaline);
% write to the matrix
rnaCounts(lc,k) = str2num(rnaSplit{2});
lc = lc + 1;
end
fclose(rnafid);
end
It depends on the size of your data. In Matlab, each “double” number (float) takes 8 Bytes memory. Thus, a vector that contains 10 numbers needs 80 Bytes. With this rule of thumb you can compute how much memory you need for your processes.
Increase physical memory. If the high memory usage is caused by the computer running multiple programs at the same time, users could close the program to solve this problem. Or if a program occupies too much memory, users can also end this program to solve this problem.
However, it’s safer to have more because MATLAB is usually not the only process that’s running. If you run MATLAB and chrome simultaneously, and run an intensive code on MATLAB, you can expect it to run out of memory pretty soon. I would suggest an 8 GB RAM to be safe, and a minimum of 4 GB.
Sometimes, the Windows 10 high memory usage is caused by memory leak, which is caused by defective software design. Memory leak has great influence on computer server where programs will run for a long time.
As verified by the OP, the str2num
function in the Linux version of Matlab 2015b has a memory leak. This function is not very useful anyway as it is designed to parse strings representing entire matrices (1 2; 3 4
) rather than the typical use case of parsing a single number (1.234
). Use str2double
when doing simple number parsing; it is faster even when str2num
isn't broken.
It is likely that using a different version of Matlab would also work around the problem, because in my experience, these kinds of memory bugs don't usually persist from one version to the next.
Often, high-level I/O functions, such as dlmread
or textscan
are useful to read such text formats. Use dlmread
if you have only numeric data,
and textscan
for more complex formats.
The sample data you provided is:
A2LD1|87769 135.5735
As you only need the number in the second column and discard the identifier in the first column, all you have is numeric data, and you can use dlmread
.
data = dlmread(fullFileName, '\t', 1, 1);
The \t
is to specify that the delimiter (column separator) is a Tab. The two 1
s are to specify a row offset and a column offset, i.e. ignore the first row (the header) and the first column (id) of the file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With