I am searching some optimized datatypes for "observations-variables" table in Matlab, that can be fast and easily accessed by columns (through variables) and by rows (through observations).
Here is сomparison of existing Matlab datatypes:
I wonder if I can use some simpler and optimized version of Table data type, if I want just to combine row-number and column-variable indexing with only numerical variables -OR- any variable type.
Results of test script:
----
TEST1 - reading individual observations
Matrix: 0.072519 sec
Table: 18.014 sec
Array of structures: 0.49896 sec
Structure of arrays: 4.3865 sec
----
TEST2 - reading individual variables
Matrix: 0.0047834 sec
Table: 0.0017972 sec
Array of structures: 2.2715 sec
Structure of arrays: 0.0010529 sec
Test script:
Nobs = 1e5; % number of observations-rows
varNames = {'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O'};
Nvar = numel(varNames); % number of variables-colums
M = randn(Nobs, Nvar); % matrix
T = array2table(M, 'VariableNames', varNames); % table
NS = struct; % nonscalar structure = array of structures
for i=1:Nobs
for v=1:Nvar
NS(i).(varNames{v}) = M(i,v);
end
end
SS = struct; % scalar structure = structure of arrays
for v=1:Nvar
SS.(varNames{v}) = M(:,v);
end
%% TEST 1 - reading individual observations (row-wise)
disp('----'); disp('TEST1 - reading individual observations');
tic; % matrix
for i=1:Nobs
x = M(i,:); end
disp(['Matrix: ', num2str(toc()), ' sec']);
tic; % table
for i=1:Nobs
x = T(i,:); end
disp(['Table: ', num2str(toc), ' sec']);
tic;% nonscalar structure = array of structures
for i=1:Nobs
x = NS(i); end
disp(['Array of structures: ', num2str(toc()), ' sec']);
tic;% scalar structure = structure of arrays
for i=1:Nobs
for v=1:Nvar
x.(varNames{v}) = SS.(varNames{v})(i);
end
end
disp(['Structure of arrays: ', num2str(toc()), ' sec']);
%% TEST 2 - reading individual variables (column-wise)
disp('----'); disp('TEST2 - reading individual variables');
tic; % matrix
for v=1:Nvar
x = M(:,v); end
disp(['Matrix: ', num2str(toc()), ' sec']);
tic; % table
for v=1:Nvar
x = T.(varNames{v}); end
disp(['Table: ', num2str(toc()), ' sec']);
tic; % nonscalar structure = array of structures
for v=1:Nvar
for i=1:Nobs
x(i,1) = NS(i).(varNames{v});
end
end
disp(['Array of structures: ', num2str(toc()), ' sec']);
tic; % scalar structure = structure of arrays
for v=1:Nvar
x = SS.(varNames{v}); end
disp(['Structure of arrays: ', num2str(toc()), ' sec']);
I would use matrices, since they're the fastest and most straightforward to use, and then create a set of enumerated column labels to make indexing columns easier. Here are a few ways to do this:
containers.Map
object:Given your variable names, and assuming they map in order from columns 1 through N
, you can create a mapping like so:
varNames = {'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O'};
col = containers.Map(varNames, 1:numel(varNames));
And now you can use the map to access columns of your data by variable name. For example, if you want to fetch the columns for variables A
and C
(i.e. the first and third) from a matrix data
, you would do this:
subData = data(:, [col('A') col('C')]);
struct
:You can create a structure with the variable names as its fields and the corresponding column indices as their values like so:
enumData = [varNames; num2cell(1:numel(varNames))];
col = struct(enumData{:});
And here's what col
contains:
struct with fields:
A: 1
B: 2
C: 3
D: 4
E: 5
F: 6
G: 7
H: 8
I: 9
J: 10
K: 11
L: 12
M: 13
N: 14
O: 15
And you would access columns A
and C
like so:
subData = data(:, [col.A col.C]);
% ...or with dynamic field names...
subData = data(:, [col.('A') col.('C')]);
You could just create a variable in your workspace for every column name and store the column indices in them. This will pollute your workspace with more variables, but gives you a terse way to access column data. Here's an easy way to do it, using the much-maligned eval
:
enumData = [varNames; num2cell(1:numel(varNames))];
eval(sprintf('%s=%d;', enumData{:}));
And accessing columns A
and C
is as easy as:
subData = data(:, [A C]);
This is probably a good dose of overkill, but if you're going to use the same mapping of column labels and indices for many analyses you could create an enumeration class, save it somewhere on your MATLAB path, and never have to worry about defining your column enumerations again. For example, here's a ColVar
class with 15 enumerated values:
classdef ColVar < double
enumeration
A (1)
B (2)
C (3)
D (4)
E (5)
F (6)
G (7)
H (8)
I (9)
J (10)
K (11)
L (12)
M (13)
N (14)
O (15)
end
end
And you would access columns A
and C
like so:
subData = data(:, [ColVar.A ColVar.C]);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With