Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matlab array having columns of different data types?

I want to make a SQL-style table in MatLab, meaning each row is an observation and each column is a field. It's all numeric, so I should be able to hold it in a 2D array, but for space-efficiency I need some of the fields to take up fewer bits than others.

Is there a way to have an array A where A(:,1) is all of type uint32 and A(:,2) is all of type uint8, for instance?

I'm currently accomplishing this with a cell-array of arrays, where each cell in the cell-array represents a column (as a nx1 array), and then I access the individual record's value array-style.

Example: to get field 2 of record 45 I use A{2}(45).

Problem: this isn't very speed-efficient, since I can't vectorize it to get all fields of a particular record (side question: Is there a way to vectorize like A{1:3}(45) ?).

like image 688
TheBigAmbiguous Avatar asked Oct 20 '22 12:10

TheBigAmbiguous


1 Answers

In short, no, it's not possible to do this in Matlab. It is at odds with how Matlab's basic data types work. But that's okay. The performant way to do table-style work in Matlab is with something like what you've already got – a cell or other composite type containing columns as homogeneous arrays – but changing your code to use vectorized column-oriented functions on them.

It sounds like you're asking about how to structure a table-like object in Matlab such that the fields of a record with heterogeneous types are contiguous in memory, like a C struct or a traditional record-oriented RDBMS table's physical layout. Matlab's data types don't work like that. All of Matlab's primitive arrays are of homogeneous type, laid out contiguously in memory; all the heterogeneous types are built out of cells, structs, objects, or other composite types that reference the primitive arrays they "contain".

So there are a lot of ways you could build tables with different column types, using cells like you're doing, or table, or roll-your-own relation-style classes. But they all boil down to composite types storing the different primitive types in separate primitive arrays, so they'd all have the same access time characteristics as your cell-based implementation. Your current "cell array of columns" structure is fine, and typical of how you'd represent that data in Matlab. The other implementations will give you different syntax and more powerful functions to work with—and that's a good reason to use them—but their underlying data structures will look very much like what you've already got. (For what it's worth, the table data type that @Marcin mentioned sounds great: convenient syntax and a nice set of functions. But it's basically a wrapper on top of your cell-based solution, with the same performance characteristics.)

Matlab's not built for iterating over individual "records" with heterogeneous fields and working with them one or a few at a time, as is typical in many other languages. To get fast in Matlab, you have to reorganize your algorithms to operate across the elements of a column or other primitive array. That is fundamentally what "vectorization" is. You can do it; all sorts of relational-style operations can be done efficiently in idiomatic Matlab code, using stuff like ismember, unique, index mapping, accumarray, and so on. You just need to change your approach.

Alternatives

The other way to do "tables" in Matlab is to do arrays of structs or cells, where each field of the struct or cell holds a scalar value. (An M row by N col table is an M-long array T of structs, each with N fields; T(i) gets the i-th row.) This will give you faster access to an individual "record" because it's already constructed. But it will be lousy in both speed and memory because then every element of every record is stored in its own 1-by-1 primitive array. (E.g. an M rows by N cols table ends up using O(M*N) primitive arrays instead of O(N).) And you can't use any vectorized operations on that arrangement.

A couple other thoughts

If you have any string columns, you'll probably need to build a custom string type or two. Matlab's basic string types, char and cellstr, are slow and memory-intensive, and don't support some of the polymorphic operations you might want to do on columns.

Be careful with those int types. Matlab's promotion rules for mixed-type arithmetic are odd, for historical reasons. Doubles get narrowed to ints when they're mixed, and they can end up "contaminating" data in functions they're passed in to. This makes ints less useful in practice than you might expect; you need to have guard code around them.

(And strictly speaking, you could do contiguous-record style stuff by dropping down to MEX or Java and writing all your code there, but then you're just writing C or Java instead of Matlab, in which case why use Matlab?)

like image 50
Andrew Janke Avatar answered Nov 02 '22 10:11

Andrew Janke