Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Octave / Matlab : Determine unique rows in very large matrix

Tags:

matlab

octave

I have a 13146 x 13146 matrix in Octave whose unique rows I want to determine. unique(M, "rows") fails because of Octave internals and/or memory limitations.

I looked at other posts on finding unique rows, but none of them addressed this issue with large matrices.

My approach now would be to "divide and conquer", e. g. by

A(:,:,i)=M(r_i:s_i,:)
B(:,:,i)=unique(A(:,:,i), "rows")

with i the index of the submatrix, r_i and s_i the starting and ending row numbers of the submatrices. To return all the data into one large matrix (and again determine unique rows):

C(:,:,i)=B(:,:,i)'
D=reshape(C,l,n*m)
E=D'
F=unique(E, "rows")

with n the number of submatrices, m the original number of rows in a submatrix and l the number of columns.

Is there a better way to achieve the desired result?

like image 843
Raketenolli Avatar asked Apr 23 '15 16:04

Raketenolli


People also ask

How do I find unique rows in Matlab?

C = unique( A ) returns the same data as in A , but with no repetitions. C is in sorted order. If A is a table or timetable, then unique returns the unique rows in A in sorted order.

How do you select a row in a matrix?

You should therefore use a comma to separate the rows you want to select from the columns. For example: my_matrix[1,2] selects the element at the first row and second column. my_matrix[1:3,2:4] results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4.


2 Answers

Radix sort it!

Sounds like you need a memory-efficient sorting algorithm. Unique rows can be found by first sorting the rows, and then checking adjacent rows for duplicates. You could adapt a radix sort for this, sorting on each column in sequence (instead of each digit in sequence). That would be a peak memory cost of sorting one column instead of the entire matrix. Then step through the rows in the sorted result and eliminate duplicates. That's an O(n) operation and only needs enough memory to hold two rows.

It can be "stable", too. If you track the rearranged row indexes in addition to the rearranged row values during the process of sorting, you can compute the input-output mapping indexes. (These are the I in the signature of Matlab's own [B,I] = sort(A).) That in turn will allow you to rearrange the post-duplicate-removal rows back in to their original order in the input, so you can preserve their order. (Like the setOrder='stable' option of Matlab's unique().) They can also be used to calculate the in-out mapping indexes for the overall uniqueness operation, so you can reproduce the full multi-output signature of unique(), which can be quite useful.

Example code

Here's a basic example implementation. I haven't tested it thoroughly, so don't use it in production without doing your own testing.

function A = rrunique(A)
%RRUNIQUE "Radix Row Unique" - find unique rows using radix sort
%
% # Returns the distinct rows in A. Uses the memory-efficient radix sort
% # algorithm, so peak memory usage stays low(ish) for large matrices.

% # This uses a modified radix sort where only the row remapping indexes are
% # rearranged at each step, instead of sorting the whole input, to avoid
% # having to rewrite the large input matrix.

ix = 1:size(A,1); % # Final in-out mapping indexes

% # Radix sort the rows
for iCol = size(A,2):-1:1
    c = A(ix,iCol);
    [~,ixStep] = sort(c);
    % # Don't do this! too slow
    % # A = A(ixStep,:);
    % # Just reorder the mapping indexes
    ix = ix(ixStep);
end    

% # Now, reorder the big array all at once
A = A(ix,:);

% # Remove duplicates
tfKeep = true(size(A,1),1);
priorRow = A(1,:);
for iRow = 2:size(A,1)
    thisRow = A(iRow,:);
    if isequal(thisRow, priorRow)
        tfKeep(iRow) = false;
    else
        priorRow = thisRow;
    end
end
A = A(tfKeep,:);

end

When I tested this on a matrix of your size on Matlab R2014b on OS X, it peaked at around 3 GB of memory used, vs about 1 GB to hold just the input matrix. Not bad.

>> m = rand([13146,13146]);
>> tic; rrunique(m); toc
Elapsed time is 17.435783 seconds.
like image 176
Andrew Janke Avatar answered Sep 20 '22 21:09

Andrew Janke


You can use a sliding window on columns and use a window size that doesn't cause a memory problem. Here is a solution

function A = winuninque(A, winSz)
nCol = size(A,2);
I = zeros(size(A,1), 0);
for k=1:winSz:nCol
    [~, ~, I] = unique([I A(:,k:min(k+winSz-1,end))], 'rows');
end
[~, I] = unique(I);
A = A(I, :);
end

Benchmark

In order to actually have some duplicate rows it's better to generate the matrix with some duplicates, otherwise it will be just sorting. Here is a comparison between different approaches:

>> A=repmat(rand(13146,13146), 2, 1);
>> A=A(randperm(end), :);
>> A=A(1:end/2,:);

>> tic; B=rrunique(A); toc
Elapsed time is 13.318752 seconds.
>> tic; C=winunique(A, 16); toc
Elapsed time is 6.606122 seconds.
>> tic; D=unique(A, 'rows'); toc
Elapsed time is 29.815333 seconds.
>> isequal(B,C,D)
ans =
     1
>> size(D)
ans =
        9880       13146
like image 22
Mohsen Nosratinia Avatar answered Sep 17 '22 21:09

Mohsen Nosratinia