Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matlab data structure for mixed type - what's time + space efficient?

I need to process large amounts of tabular data of mixed type - strings and doubles. A standard problem, I would think. What is the best data structure in Matlab for working with this?

Cellarray is definitely not the answer. It is extremely memory inefficient. (tests shown below). Dataset (from stats toolbox) is horribly time and space inefficient. That leaves me with structarray or struct of arrays. I did a test across all four different options for both time and memory below and it seems to me the struct of arrays is the best option for the things I tested for.

I am relatively new to Matlab and this is a bit disappointing, frankly. Anyway - looking for advice on whether I am missing something, or if my tests are accurate/reasonable. Am I missing other considerations besides access/conversion/memory usage that are likely to come up as I code more using this stuff. (fyi am using R2010b)

** Test #1: Access speed Accessing a data item.

cellarray:0.002s
dataset:36.665s       %<<< This is horrible
structarray:0.001s
struct of array:0.000s

** Test #2: Conversion speed and memory usage I dropped dataset from this test.

Cellarray(doubles)->matrix:d->m: 0.865s
Cellarray(mixed)->structarray:c->sc: 0.268s
Cellarray(doubles)->structarray:d->sd: 0.430s
Cellarray(mixed)->struct of arrays:c->sac: 0.361s
Cellarray(doubles)->struct of arrays:d->sad: 0.887s
  Name           Size               Bytes  Class     Attributes
    c         100000x10            68000000  cell                
    d         100000x10            68000000  cell                
    m         100000x10             8000000  double              
    sac            1x1             38001240  struct              
    sad            1x1              8001240  struct              
    sc        100000x1             68000640  struct              
    sd        100000x1             68000640  struct  

================== CODE: TEST#1

  %% cellarray
  c = cell(100000,10);
  c(:,[1,3,5,7,9]) = num2cell(zeros(100000,5));
  c(:,[2,4,6,8,10]) = repmat( {'asdf'}, 100000, 5);
  cols = strcat('Var', strtrim(cellstr(num2str((1:10)'))))';
  te = tic;
  for iii=1:1000
      x = c(1234,5);
  end
  te = toc(te);
  fprintf('cellarray:%0.3fs\n', te);
  %% dataset
  ds = dataset( { c, cols{:} } );
  te = tic;
  for iii=1:1000
      x = ds(1234,5);
  end
  te = toc(te);
  fprintf('dataset:%0.3fs\n', te);
  %% structarray
  s = cell2struct( c, cols, 2 );
  te = tic;
  for iii=1:1000
      x = s(1234).Var5;
  end
  te = toc(te);
  fprintf('structarray:%0.3fs\n', te);
  %% struct of arrays
  for iii=1:numel(cols)
      if iii/2==floor(iii/2) % even => string
          sac.(cols{iii}) = c(:,iii);
      else
          sac.(cols{iii}) = cell2mat(c(:,iii));
      end
  end
  te = tic;
  for iii=1:1000
      x = sac.Var5(1234);
  end
  te = toc(te);
  fprintf('struct of array:%0.3fs\n', te);

================== CODE: TEST #2

%% cellarray
% c - cellarray containing mixed type 
c = cell(100000,10);
c(:,[1,3,5,7,9]) = num2cell(zeros(100000,5));
c(:,[2,4,6,8,10]) = repmat( {'asdf'}, 100000, 5);
cols = strcat('Var', strtrim(cellstr(num2str((1:10)'))))';
% c - cellarray containing doubles only
d = num2cell( zeros( 100000, 10 ) );
%% matrix
% doubles only
te = tic;
m = cell2mat(d);
te = toc(te);
fprintf('Cellarray(doubles)->matrix:d->m: %0.3fs\n', te);
%% structarray
% mixed
te = tic;
sc = cell2struct( c, cols, 2 );
te = toc(te);
fprintf('Cellarray(mixed)->structarray:c->sc: %0.3fs\n', te);
% doubles
te = tic;
sd = cell2struct( d, cols, 2 );
te = toc(te);
fprintf('Cellarray(doubles)->structarray:d->sd: %0.3fs\n', te);
%% struct of arrays
% mixed
te = tic;
for iii=1:numel(cols)
    if iii/2==floor(iii/2) % even => string
        sac.(cols{iii}) = c(:,iii);
    else
        sac.(cols{iii}) = cell2mat(c(:,iii));
    end
end
te = toc(te);
fprintf('Cellarray(mixed)->struct of arrays:c->sac: %0.3fs\n', te);
% doubles
te = tic;
for iii=1:numel(cols)
    sad.(cols{iii}) = cell2mat(d(:,iii));
end
te = toc(te);
fprintf('Cellarray(doubles)->struct of arrays:d->sad: %0.3fs\n', te);
%% 
clear iii cols te;
whos
like image 367
Matal Avatar asked Apr 26 '13 19:04

Matal


2 Answers

The way to make Matlab code space- and time- efficient is to work with large arrays of primitives - that is, arrays of doubles, ints, or chars. This gives you a tighter layout in memory and lets you do vectorized operations.

For tabular data, each column is going to be homogenous in type, but different columns may be of different types, and you'll typically have a lot more rows than columns. And you'll often be doing operations - comparisons or math - over all elements of a column, or a masked selection of a column, which lends itself to vectorized operations. So store each column as a column array, hopefully of primitives. You can stick those columns either in fields of a struct or elements of a cell vector; doesn't matter much performance-wise, and the struct form will be much more readable and looks more like a table. A 2-d cell array or other data structure that breaks all the elements up in to their own small mxarrays is not going to perform accepatbly.

That is, if you have a 10,000 row by 10 column table, you want to have a 10-long cell array or 10-field struct, with each of those fields or elements holding a 10,000-long primitive column vector.

The dataset object object is basically a wrapper around a struct of column arrays like described before, stuck in an object. But objects in Matlab have greater overhead than regular structs and cells; you're paying for one or more method calls on every access to it. Have a look at Is MATLAB OOP slow or am I doing something wrong? (full disclosure: that's one of my answers).

The test you have set up isn't indicative of how good Matlab code will perform, because it's doing scalar single-element access. That is, it's paying for column and then row element access on every pass through the loop. If your Matlab code is doing that, you're already out of luck. To be fast, you need to pop out columns outside of a loop - that is, hoist the expensive column access operation to an outer loop or setup code - and then either do vectorized operations (like +, ==, '<', ismember, and so on) on entire column vectors, or loop over primitive numeric vectors (which the JIT can optimize). If you do this, then dataset and other object-based tabular structures can have decent performance.

Strings in Matlab kind of suck, unfortunately. You want to get away from cellstrs. You have a couple options.

  • If the strings in a column are of about the same length, and you don't have any long strings in them, you can store a vector of strings as a 2-D char array. This is a single contiguous array in memory and is more space-efficient than a cell array, and may be faster for comparison operations and so on. It's also one of Matlab's native string representations, so normal string functions will work with it.
  • If the strings are low cardinality (that is, the number of distinct values is small relative to the total number of elements), you can encode them as "symbols", storing them as an array of primitive ints which are indexes in to a list of the distinct string values. The unique and ismember function can help implement these encodings. As long as you're just doing equality tests and not sorting, these encoded string columns will operate at numeric speed.
  • I believe one of the Toolboxes, maybe the one with dataset, has support for "classifier" or "categorical" variables, which are basically a ready-made implementation of that low-cardinality encoding.
  • Don't bother with Java Strings; the overhead in crossing the Matlab-to-Java barrier will make it a net loss.
  • Hopefully somebody has come up with something else by now.

If you have to stick with cellstrs, store them as cellstr column vectors inside the structure as described above; that way you only have to pay the price of cell access when you're actually operating on a string column.

like image 113
Andrew Janke Avatar answered Oct 12 '22 19:10

Andrew Janke


I would say that if you need to manage large amount of data, then MATLAB is not the best choice to begin with. Go for a proper db and eventually import the data you need into MATLAB.

However, if you're planning to use MATLAB anyways I would still choose cellarrays, that is if you do not need syntactic references to your data in the form of fieldnames as in structures.

When using cellarrays, keep in mind that each cell imposes 112 bytes of overhead. Therefore, I would create a cell for each column (not a cell for each scalar double):

c = cell(1,10);
c(1,1:2:10) = num2cell(rand(1e5,5),1);
c(1,2:2:10) = {cellstr(repmat('asdf', 100000, 1))};

and memory-wise (no changes in time):

Name           Size               Bytes  Class    Attributes
c              1x10            38000600  cell   

Also, what you call a struct of array is usually referred to with 'scalar' structure as opposed to struct array (or non-scalar structure).

If I recall correctly, structs tend to degrade in performance for read/writes when you start nesting fields (I need to find the specific thread though).

like image 32
Oleg Avatar answered Oct 12 '22 18:10

Oleg