Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split vector in MATLAB

I'm trying to elegantly split a vector. For example,

vec = [1 2 3 4 5 6 7 8 9 10]

According to another vector of 0's and 1's of the same length where the 1's indicate where the vector should be split - or rather cut:

cut = [0 0 0 1 0 0 0 0 1 0]

Giving us a cell output similar to the following:

[1 2 3] [5 6 7 8] [10]
like image 215
Andrej Žukov-Gregorič Avatar asked Apr 25 '15 01:04

Andrej Žukov-Gregorič


3 Answers

Solution code

You can use cumsum & accumarray for an efficient solution -

%// Create ID/labels for use with accumarray later on
id = cumsum(cut)+1   

%// Mask to get valid values from cut and vec corresponding to ones in cut
mask = cut==0        

%// Finally get the output with accumarray using masked IDs and vec values 
out = accumarray(id(mask).',vec(mask).',[],@(x) {x})

Benchmarking

Here are some performance numbers when using a large input on the three most popular approaches listed to solve this problem -

N = 100000;  %// Input Datasize

vec = randi(100,1,N); %// Random inputs
cut = randi(2,1,N)-1;

disp('-------------------- With CUMSUM + ACCUMARRAY')
tic
id = cumsum(cut)+1;
mask = cut==0;
out = accumarray(id(mask).',vec(mask).',[],@(x) {x});
toc

disp('-------------------- With FIND + ARRAYFUN')
tic
N = numel(vec);
ind = find(cut);
ind_before = [ind-1 N]; ind_before(ind_before < 1) = 1;
ind_after = [1 ind+1]; ind_after(ind_after > N) = N;
out = arrayfun(@(x,y) vec(x:y), ind_after, ind_before, 'uni', 0);
toc

disp('-------------------- With CUMSUM + ARRAYFUN')
tic
cutsum = cumsum(cut);
cutsum(cut == 1) = NaN;  %Don't include the cut indices themselves
sumvals = unique(cutsum);      % Find the values to use in indexing vec for the output
sumvals(isnan(sumvals)) = [];  %Remove NaN values from sumvals
output = arrayfun(@(val) vec(cutsum == val), sumvals, 'UniformOutput', 0);
toc

Runtimes

-------------------- With CUMSUM + ACCUMARRAY
Elapsed time is 0.068102 seconds.
-------------------- With FIND + ARRAYFUN
Elapsed time is 0.117953 seconds.
-------------------- With CUMSUM + ARRAYFUN
Elapsed time is 12.560973 seconds.

Special case scenario: In cases where you might have runs of 1's, you need to modify few things as listed next -

%// Mask to get valid values from cut and vec corresponding to ones in cut
mask = cut==0  

%// Setup IDs differently this time. The idea is to have successive IDs.
id = cumsum(cut)+1
[~,~,id] = unique(id(mask))
      
%// Finally get the output with accumarray using masked IDs and vec values 
out = accumarray(id(:),vec(mask).',[],@(x) {x})

Sample run with such a case -

>> vec
vec =
     1     2     3     4     5     6     7     8     9    10
>> cut
cut =
     1     0     0     1     1     0     0     0     1     0
>> celldisp(out)
out{1} =
     2
     3
out{2} =
     6
     7
     8
out{3} =
    10
like image 83
Divakar Avatar answered Sep 27 '22 17:09

Divakar


For this problem, a handy function is cumsum, which can create a cumulative sum of the cut array. The code that produces an output cell array is as follows:

vec = [1 2 3 4 5 6 7 8 9 10];
cut = [0 0 0 1 0 0 0 0 1 0];

cutsum = cumsum(cut);
cutsum(cut == 1) = NaN;  %Don't include the cut indices themselves
sumvals = unique(cutsum);      % Find the values to use in indexing vec for the output
sumvals(isnan(sumvals)) = [];  %Remove NaN values from sumvals
output = {};
for i=1:numel(sumvals)
    output{i} = vec(cutsum == sumvals(i)); %#ok<SAGROW>
end

As another answer shows, you can use arrayfun to create a cell array with the results. To apply that here, you'd replace the for loop (and the initialization of output) with the following line:

output = arrayfun(@(val) vec(cutsum == val), sumvals, 'UniformOutput', 0);

That's nice because it doesn't end up growing the output cell array.

The key feature of this routine is the variable cutsum, which ends up looking like this:

cutsum =
     0     0     0   NaN     1     1     1     1   NaN     2

Then all we need to do is use it to create indices to pull the data out of the original vec array. We loop from zero to max and pull matching values. Notice that this routine handles some situations that may arise. For instance, it handles 1 values at the very beginning and very end of the cut array, and it gracefully handles repeated ones in the cut array without creating empty arrays in the output. This is because of the use of unique to create the set of values to search for in cutsum, and the fact that we throw out the NaN values in the sumvals array.

You could use -1 instead of NaN as the signal flag for the cut locations to not use, but I like NaN for readability. The -1 value would probably be more efficient, as all you'd have to do is truncate the first element from the sumvals array. It's just my preference to use NaN as a signal flag.

The output of this is a cell array with the results:

output{1} =
     1     2     3
output{2} =
     5     6     7     8
output{3} =
    10

There are some odd conditions we need to handle. Consider the situation:

vec = [1 2 3 4 5 6 7 8 9 10 11 12 13 14];
cut = [1 0 0 1 1 0 0 0 0 1  0  0  0  1];

There are repeated 1's in there, as well as a 1 at the beginning and end. This routine properly handles all this without any empty sets:

output{1} = 
     2     3
output{2} =
     6     7     8     9
output{3} = 
    11    12    13
like image 32
Tony Avatar answered Sep 27 '22 16:09

Tony


You can do this with a combination of find and arrayfun:

vec = [1 2 3 4 5 6 7 8 9 10];
N = numel(vec);
cut = [0 0 0 1 0 0 0 0 1 0];
ind = find(cut);
ind_before = [ind-1 N]; ind_before(ind_before < 1) = 1;
ind_after = [1 ind+1]; ind_after(ind_after > N) = N;
out = arrayfun(@(x,y) vec(x:y), ind_after, ind_before, 'uni', 0);

We thus get:

>> celldisp(out)

out{1} =

     1     2     3         

out{2} =

     5     6     7     8    

out{3} =

    10

So how does this work? Well, the first line defines your input vector, the second line finds how many elements are in this vector and the third line denotes your cut vector which defines where we need to cut in our vector. Next, we use find to determine the locations that are non-zero in cut which correspond to the split points in the vector. If you notice, the split points determine where we need to stop collecting elements and begin collecting elements.

However, we need to account for the beginning of the vector as well as the end. ind_after tells us the locations of where we need to start collecting values and ind_before tells us the locations of where we need to stop collecting values. To calculate these starting and ending positions, you simply take the result of find and add and subtract 1 respectively.

Each corresponding position in ind_after and ind_before tell us where we need to start and stop collecting values together. In order to accommodate for the beginning of the vector, ind_after needs to have the index of 1 inserted at the beginning because index 1 is where we should start collecting values at the beginning. Similarly, N needs to be inserted at the end of ind_before because this is where we need to stop collecting values at the end of the array.

Now for ind_after and ind_before, there is a degenerate case where the cut point may be at the end or beginning of the vector. If this is the case, then subtracting or adding by 1 will generate a start and stopping position that's out of bounds. We check for this in the 4th and 5th line of code and simply set these to 1 or N depending on whether we're at the beginning or end of the array.

The last line of code uses arrayfun and iterates through each pair of ind_after and ind_before to slice into our vector. Each result is placed into a cell array, and our output follows.


We can check for the degenerate case by placing a 1 at the beginning and end of cut and some values in between:

vec = [1 2 3 4 5 6 7 8 9 10];
cut = [1 0 0 1 0 0 0 1 0 1];

Using this example and the above code, we get:

>> celldisp(out)

out{1} =

     1

out{2} =

     2     3         

out{3} =

     5     6     7

out{4} =

     9         

out{5} =

    10
like image 40
rayryeng Avatar answered Sep 27 '22 16:09

rayryeng