I need to process large amounts of tabular data of mixed type - strings and doubles. A standard problem, I would think. What is the best data structure in Matlab for working with this?
Cellarray is definitely not the answer. It is extremely memory inefficient. (tests shown below). Dataset (from stats toolbox) is horribly time and space inefficient. That leaves me with structarray or struct of arrays. I did a test across all four different options for both time and memory below and it seems to me the struct of arrays is the best option for the things I tested for.
I am relatively new to Matlab and this is a bit disappointing, frankly. Anyway - looking for advice on whether I am missing something, or if my tests are accurate/reasonable. Am I missing other considerations besides access/conversion/memory usage that are likely to come up as I code more using this stuff. (fyi am using R2010b)
** Test #1: Access speed Accessing a data item.
cellarray:0.002s
dataset:36.665s %<<< This is horrible
structarray:0.001s
struct of array:0.000s
** Test #2: Conversion speed and memory usage I dropped dataset from this test.
Cellarray(doubles)->matrix:d->m: 0.865s
Cellarray(mixed)->structarray:c->sc: 0.268s
Cellarray(doubles)->structarray:d->sd: 0.430s
Cellarray(mixed)->struct of arrays:c->sac: 0.361s
Cellarray(doubles)->struct of arrays:d->sad: 0.887s
Name Size Bytes Class Attributes
c 100000x10 68000000 cell
d 100000x10 68000000 cell
m 100000x10 8000000 double
sac 1x1 38001240 struct
sad 1x1 8001240 struct
sc 100000x1 68000640 struct
sd 100000x1 68000640 struct
================== CODE: TEST#1
%% cellarray
c = cell(100000,10);
c(:,[1,3,5,7,9]) = num2cell(zeros(100000,5));
c(:,[2,4,6,8,10]) = repmat( {'asdf'}, 100000, 5);
cols = strcat('Var', strtrim(cellstr(num2str((1:10)'))))';
te = tic;
for iii=1:1000
x = c(1234,5);
end
te = toc(te);
fprintf('cellarray:%0.3fs\n', te);
%% dataset
ds = dataset( { c, cols{:} } );
te = tic;
for iii=1:1000
x = ds(1234,5);
end
te = toc(te);
fprintf('dataset:%0.3fs\n', te);
%% structarray
s = cell2struct( c, cols, 2 );
te = tic;
for iii=1:1000
x = s(1234).Var5;
end
te = toc(te);
fprintf('structarray:%0.3fs\n', te);
%% struct of arrays
for iii=1:numel(cols)
if iii/2==floor(iii/2) % even => string
sac.(cols{iii}) = c(:,iii);
else
sac.(cols{iii}) = cell2mat(c(:,iii));
end
end
te = tic;
for iii=1:1000
x = sac.Var5(1234);
end
te = toc(te);
fprintf('struct of array:%0.3fs\n', te);
================== CODE: TEST #2
%% cellarray
% c - cellarray containing mixed type
c = cell(100000,10);
c(:,[1,3,5,7,9]) = num2cell(zeros(100000,5));
c(:,[2,4,6,8,10]) = repmat( {'asdf'}, 100000, 5);
cols = strcat('Var', strtrim(cellstr(num2str((1:10)'))))';
% c - cellarray containing doubles only
d = num2cell( zeros( 100000, 10 ) );
%% matrix
% doubles only
te = tic;
m = cell2mat(d);
te = toc(te);
fprintf('Cellarray(doubles)->matrix:d->m: %0.3fs\n', te);
%% structarray
% mixed
te = tic;
sc = cell2struct( c, cols, 2 );
te = toc(te);
fprintf('Cellarray(mixed)->structarray:c->sc: %0.3fs\n', te);
% doubles
te = tic;
sd = cell2struct( d, cols, 2 );
te = toc(te);
fprintf('Cellarray(doubles)->structarray:d->sd: %0.3fs\n', te);
%% struct of arrays
% mixed
te = tic;
for iii=1:numel(cols)
if iii/2==floor(iii/2) % even => string
sac.(cols{iii}) = c(:,iii);
else
sac.(cols{iii}) = cell2mat(c(:,iii));
end
end
te = toc(te);
fprintf('Cellarray(mixed)->struct of arrays:c->sac: %0.3fs\n', te);
% doubles
te = tic;
for iii=1:numel(cols)
sad.(cols{iii}) = cell2mat(d(:,iii));
end
te = toc(te);
fprintf('Cellarray(doubles)->struct of arrays:d->sad: %0.3fs\n', te);
%%
clear iii cols te;
whos
The way to make Matlab code space- and time- efficient is to work with large arrays of primitives - that is, arrays of doubles, ints, or chars. This gives you a tighter layout in memory and lets you do vectorized operations.
For tabular data, each column is going to be homogenous in type, but different columns may be of different types, and you'll typically have a lot more rows than columns. And you'll often be doing operations - comparisons or math - over all elements of a column, or a masked selection of a column, which lends itself to vectorized operations. So store each column as a column array, hopefully of primitives. You can stick those columns either in fields of a struct or elements of a cell vector; doesn't matter much performance-wise, and the struct form will be much more readable and looks more like a table. A 2-d cell array or other data structure that breaks all the elements up in to their own small mxarrays is not going to perform accepatbly.
That is, if you have a 10,000 row by 10 column table, you want to have a 10-long cell array or 10-field struct, with each of those fields or elements holding a 10,000-long primitive column vector.
The dataset
object object is basically a wrapper around a struct of column arrays like described before, stuck in an object. But objects in Matlab have greater overhead than regular structs and cells; you're paying for one or more method calls on every access to it. Have a look at Is MATLAB OOP slow or am I doing something wrong? (full disclosure: that's one of my answers).
The test you have set up isn't indicative of how good Matlab code will perform, because it's doing scalar single-element access. That is, it's paying for column and then row element access on every pass through the loop. If your Matlab code is doing that, you're already out of luck. To be fast, you need to pop out columns outside of a loop - that is, hoist the expensive column access operation to an outer loop or setup code - and then either do vectorized operations (like +
, ==
, '<', ismember
, and so on) on entire column vectors, or loop over primitive numeric vectors (which the JIT can optimize). If you do this, then dataset
and other object-based tabular structures can have decent performance.
Strings in Matlab kind of suck, unfortunately. You want to get away from cellstrs. You have a couple options.
char
array. This is a single contiguous array in memory and is more space-efficient than a cell array, and may be faster for comparison operations and so on. It's also one of Matlab's native string representations, so normal string functions will work with it.unique
and ismember
function can help implement these encodings. As long as you're just doing equality tests and not sorting, these encoded string columns will operate at numeric speed.dataset
, has support for "classifier" or "categorical" variables, which are basically a ready-made implementation of that low-cardinality encoding.If you have to stick with cellstrs, store them as cellstr column vectors inside the structure as described above; that way you only have to pay the price of cell access when you're actually operating on a string column.
I would say that if you need to manage large amount of data, then MATLAB is not the best choice to begin with. Go for a proper db and eventually import the data you need into MATLAB.
However, if you're planning to use MATLAB anyways I would still choose cellarrays, that is if you do not need syntactic references to your data in the form of fieldnames as in structures.
When using cellarrays, keep in mind that each cell imposes 112 bytes of overhead. Therefore, I would create a cell for each column (not a cell for each scalar double):
c = cell(1,10);
c(1,1:2:10) = num2cell(rand(1e5,5),1);
c(1,2:2:10) = {cellstr(repmat('asdf', 100000, 1))};
and memory-wise (no changes in time):
Name Size Bytes Class Attributes
c 1x10 38000600 cell
Also, what you call a struct of array is usually referred to with 'scalar' structure as opposed to struct array (or non-scalar structure).
If I recall correctly, structs tend to degrade in performance for read/writes when you start nesting fields (I need to find the specific thread though).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With