Let A and B be two table (or dataset) objects having a few factor variables K1, K2, ..., Kn in common. I'll call these n factors key factors. Also, I'll say that a row of A and a row of B match iff they match at these key factors, i.e. iff A.K1 == B.K1 & A.K2 == B.K2 & ... & A.Kn == B.Kn.
(I give an example of the above, with code, at the end of this post.)
I'm looking for an efficient way to extract all the rows of
Athat "match" one1 row ofB.
(More precisely, I want to generate a new table C consisting of all the rows of A that "match" one row of B.)
Is there an efficient way to do this? (I'm trying to avoid for-Loop Hell here, or any solution that depends on knowing, say, the number of key factors, etc., since I want to use this in code that does not know much about A and B until runtime.)
I tried intersect(A, B), but it fails with the error:
>> keyfactors = intersect(A.Properties.VarNames, B.Properties.VarNames);
>> intersect(A, B, keyfactors)
Error using table/intersect (line 41)
A and B must contain the same variables.
The only other strategy I can come up with (though I don't yet know how to implement it) is to somehow synthesize a function on-the-fly based on the information encoded in B. For example, given an indexing table B like this
B = cell2table({
'even', 'red', 'diamonds';
'odd', 'yellow', 'spades';
'odd', 'green', 'hearts'
}, 'VariableNames', ...
{'Parity', 'TrafficLight', 'Suit'});
...somehow programmatically generate a function like
fn = @(c) any([all([strcmp(c(:, 1), 'even'), ...
strcmp(c(:, 2), 'red'), ...
strcmp(c(:, 3), 'diamonds')], 2), ...
all([strcmp(c(:, 1), 'odd'), ...
strcmp(c(:, 2), 'yellow'), ...
strcmp(c(:, 3), 'spades')], 2), ...
all([strcmp(c(:, 1), 'odd'), ...
strcmp(c(:, 2), 'green'), ...
strcmp(c(:, 3), 'hearts')], 2)], 2);
...that could then be used in an expression like this:
Ac = table2cell(A);
C = cell2table(Ac(fn(Ac), :), ...
'VariableNames', B.Properties.VariableNames);
Unfortunately, I can't figure out how to programmatically synthesize a function like fn above without resorting to the (strongly deprecated) use of eval.
Here are examples of the kinds of A and B tables I'm thinking of. (For those who don't have a version of MATLAB that supports the (rather new) table objects, the code below should work if the strings "table" and "VariableNames" are replaced with the strings "dataset" and "VarNames", respectively.)
% "data" table
A = cell2table({
'even', 'red', 'spades', 38, 0.9837;
'even', 'red', 'hearts', 19, 0.5695;
'even', 'red', 'diamonds', 89, 0.2629;
'even', 'red', 'diamonds', 98, 0.3578;
'even', 'red', 'diamonds', 92, 0.2596;
'even', 'red', 'diamonds', 69, 0.5751;
'even', 'red', 'diamonds', 77, 0.6318;
'even', 'yellow', 'clubs', 22, 0.6917;
'even', 'green', 'spades', 35, 0.6674;
'even', 'green', 'hearts', 67, 0.7896;
'even', 'green', 'hearts', 49, 0.5025;
'even', 'green', 'hearts', 64, 0.5318;
'odd', 'red', 'spades', 22, 0.5587;
'odd', 'red', 'hearts', 51, 0.9122;
'odd', 'red', 'diamonds', 74, 0.3343;
'odd', 'red', 'diamonds', 69, 0.2911;
'odd', 'yellow', 'spades', 33, 0.2653;
'odd', 'yellow', 'spades', 38, 0.2549;
'odd', 'yellow', 'diamonds', 1, 0.2064;
'odd', 'yellow', 'diamonds', 25, 0.8257;
'odd', 'green', 'spades', 64, 0.4348;
'odd', 'green', 'hearts', 59, 0.8644;
'odd', 'green', 'hearts', 4, 0.6374;
'odd', 'green', 'hearts', 11, 0.3354
}, 'VariableNames', ...
{'Parity', 'TrafficLight', 'Suit', 'order', 'prevalence'});
% "indexing" table
B = cell2table({
'i', 'even', 'red', 'diamonds';
'ii', 'odd', 'yellow', 'spades';
'iii', 'odd', 'green', 'hearts'
}, 'VariableNames', ...
[{'class'} A.Properties.VariableNames(1:3)]);
keyfactors = intersect(B.Properties.VariableNames, ...
A.Properties.VariableNames, ...
'stable');
% this fails:
% intersect(A, B, keyfactors);
% desired subtable would look like this one
C = cell2table({
'even', 'red', 'diamonds', 89, 0.2629;
'even', 'red', 'diamonds', 98, 0.3578;
'even', 'red', 'diamonds', 92, 0.2596;
'even', 'red', 'diamonds', 69, 0.5751;
'even', 'red', 'diamonds', 77, 0.6318;
'odd', 'yellow', 'spades', 33, 0.2653;
'odd', 'yellow', 'spades', 38, 0.2549;
'odd', 'green', 'hearts', 59, 0.8644;
'odd', 'green', 'hearts', 4, 0.6374;
'odd', 'green', 'hearts', 11, 0.3354
}, 'VariableNames', ...
{'Parity', 'TrafficLight', 'Suit', 'order', 'prevalence'});
1 One may safely assume that, in B, any combination of values (aka "levels") for the key factors would appear at most once. This would usually not be true for A. In practice, A would contain real data, while B would be "synthetic" (i.e. constructed specifically for the operation described here), and would have fewer variables and much fewer rows than A does. One may therefore think of B as an "indexing table" for A.
This returns what you want, but I don't know if it works generally so you'll need to test:
[~,iB,iA] = intersect(B.Properties.VariableNames, ...
A.Properties.VariableNames, ...
'stable');
C = A(ismember(A(:,iA),intersect(A(:,iA),B(:,iB),'stable')),:)
Part of the problem is that table/intersect only returns the first occurrence when rows a re repeated. I use table/ismember to produce logical indices to get around this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With