Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matlab Table / Dataset type optimization

I am searching some optimized datatypes for "observations-variables" table in Matlab, that can be fast and easily accessed by columns (through variables) and by rows (through observations).

Here is сomparison of existing Matlab datatypes:

  1. Matrix is very fast, hovewer, it has no built-in indexing labels/enumerations for its dimensions, and you can't always remember variable name by column index.
  2. Table has very bad performance, especially when reading individual rows/columns in a for loop (I suppose it runs some slow convertion methods, and is designed to be more Excel-like).
  3. Scalar structure (structure of column arrays) datatype - fast column-wise access to variables as vectors, but slow row-wise conversion to observations.
  4. Nonscalar structure (array of structures) - fast row-wise access to observations as vectors, but slow column-wise conversion to variables.

I wonder if I can use some simpler and optimized version of Table data type, if I want just to combine row-number and column-variable indexing with only numerical variables -OR- any variable type.

Results of test script:

----
TEST1 - reading individual observations
Matrix: 0.072519 sec
Table: 18.014 sec
Array of structures: 0.49896 sec
Structure of arrays: 4.3865 sec
----
TEST2 - reading individual variables
Matrix: 0.0047834 sec
Table: 0.0017972 sec
Array of structures: 2.2715 sec
Structure of arrays: 0.0010529 sec

Test script:

Nobs = 1e5; % number of observations-rows
varNames = {'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O'};
Nvar = numel(varNames); % number of variables-colums

M = randn(Nobs, Nvar); % matrix

T = array2table(M, 'VariableNames', varNames); % table

NS = struct; % nonscalar structure = array of structures
for i=1:Nobs
    for v=1:Nvar
        NS(i).(varNames{v}) = M(i,v);
    end
end

SS = struct; % scalar structure = structure of arrays
for v=1:Nvar
    SS.(varNames{v}) = M(:,v);
end

%% TEST 1 - reading individual observations (row-wise)
disp('----'); disp('TEST1 - reading individual observations');

tic; % matrix
for i=1:Nobs
   x = M(i,:); end
disp(['Matrix: ', num2str(toc()), ' sec']);

tic; % table
for i=1:Nobs
   x = T(i,:); end
disp(['Table: ', num2str(toc), ' sec']);

tic;% nonscalar structure = array of structures
for i=1:Nobs
    x = NS(i); end
disp(['Array of structures: ', num2str(toc()), ' sec']);

tic;% scalar structure = structure of arrays 
for i=1:Nobs
    for v=1:Nvar
        x.(varNames{v}) = SS.(varNames{v})(i);
    end
end
disp(['Structure of arrays: ', num2str(toc()), ' sec']);

%% TEST 2 - reading individual variables (column-wise)
disp('----'); disp('TEST2 - reading individual variables');

tic; % matrix
for v=1:Nvar
   x = M(:,v); end
disp(['Matrix: ', num2str(toc()), ' sec']);

tic; % table
for v=1:Nvar
   x = T.(varNames{v}); end
disp(['Table: ', num2str(toc()), ' sec']);

tic; % nonscalar structure = array of structures
for v=1:Nvar
    for i=1:Nobs
        x(i,1) = NS(i).(varNames{v});
    end
end
disp(['Array of structures: ', num2str(toc()), ' sec']);

tic; % scalar structure = structure of arrays
for v=1:Nvar
    x = SS.(varNames{v}); end
disp(['Structure of arrays: ', num2str(toc()), ' sec']);
like image 964
Sairus Avatar asked Jun 21 '17 14:06

Sairus


1 Answers

I would use matrices, since they're the fastest and most straightforward to use, and then create a set of enumerated column labels to make indexing columns easier. Here are a few ways to do this:


Use a containers.Map object:

Given your variable names, and assuming they map in order from columns 1 through N, you can create a mapping like so:

varNames = {'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O'};
col = containers.Map(varNames, 1:numel(varNames));

And now you can use the map to access columns of your data by variable name. For example, if you want to fetch the columns for variables A and C (i.e. the first and third) from a matrix data, you would do this:

subData = data(:, [col('A') col('C')]);


Use a struct:

You can create a structure with the variable names as its fields and the corresponding column indices as their values like so:

enumData = [varNames; num2cell(1:numel(varNames))];
col = struct(enumData{:});

And here's what col contains:

struct with fields:

  A: 1
  B: 2
  C: 3
  D: 4
  E: 5
  F: 6
  G: 7
  H: 8
  I: 9
  J: 10
  K: 11
  L: 12
  M: 13
  N: 14
  O: 15

And you would access columns A and C like so:

subData = data(:, [col.A col.C]);
% ...or with dynamic field names...
subData = data(:, [col.('A') col.('C')]);


Make a bunch of variables:

You could just create a variable in your workspace for every column name and store the column indices in them. This will pollute your workspace with more variables, but gives you a terse way to access column data. Here's an easy way to do it, using the much-maligned eval:

enumData = [varNames; num2cell(1:numel(varNames))];
eval(sprintf('%s=%d;', enumData{:}));

And accessing columns A and C is as easy as:

subData = data(:, [A C]);


Use an enumeration class:

This is probably a good dose of overkill, but if you're going to use the same mapping of column labels and indices for many analyses you could create an enumeration class, save it somewhere on your MATLAB path, and never have to worry about defining your column enumerations again. For example, here's a ColVar class with 15 enumerated values:

classdef ColVar < double
  enumeration
    A (1)
    B (2)
    C (3)
    D (4)
    E (5)
    F (6)
    G (7)
    H (8)
    I (9)
    J (10)
    K (11)
    L (12)
    M (13)
    N (14)
    O (15)
  end
end

And you would access columns A and C like so:

subData = data(:, [ColVar.A ColVar.C]);
like image 170
gnovice Avatar answered Oct 29 '22 17:10

gnovice