I have a 13146 x 13146 matrix in Octave whose unique rows I want to determine. unique(M, "rows")
fails because of Octave internals and/or memory limitations.
I looked at other posts on finding unique rows, but none of them addressed this issue with large matrices.
My approach now would be to "divide and conquer", e. g. by
A(:,:,i)=M(r_i:s_i,:)
B(:,:,i)=unique(A(:,:,i), "rows")
with i
the index of the submatrix, r_i
and s_i
the starting and ending row numbers of the submatrices.
To return all the data into one large matrix (and again determine unique rows):
C(:,:,i)=B(:,:,i)'
D=reshape(C,l,n*m)
E=D'
F=unique(E, "rows")
with n
the number of submatrices, m
the original number of rows in a submatrix and l
the number of columns.
Is there a better way to achieve the desired result?
C = unique( A ) returns the same data as in A , but with no repetitions. C is in sorted order. If A is a table or timetable, then unique returns the unique rows in A in sorted order.
You should therefore use a comma to separate the rows you want to select from the columns. For example: my_matrix[1,2] selects the element at the first row and second column. my_matrix[1:3,2:4] results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4.
Sounds like you need a memory-efficient sorting algorithm. Unique rows can be found by first sorting the rows, and then checking adjacent rows for duplicates. You could adapt a radix sort for this, sorting on each column in sequence (instead of each digit in sequence). That would be a peak memory cost of sorting one column instead of the entire matrix. Then step through the rows in the sorted result and eliminate duplicates. That's an O(n)
operation and only needs enough memory to hold two rows.
It can be "stable", too. If you track the rearranged row indexes in addition to the rearranged row values during the process of sorting, you can compute the input-output mapping indexes. (These are the I
in the signature of Matlab's own [B,I] = sort(A)
.) That in turn will allow you to rearrange the post-duplicate-removal rows back in to their original order in the input, so you can preserve their order. (Like the setOrder='stable'
option of Matlab's unique()
.) They can also be used to calculate the in-out mapping indexes for the overall uniqueness operation, so you can reproduce the full multi-output signature of unique()
, which can be quite useful.
Here's a basic example implementation. I haven't tested it thoroughly, so don't use it in production without doing your own testing.
function A = rrunique(A)
%RRUNIQUE "Radix Row Unique" - find unique rows using radix sort
%
% # Returns the distinct rows in A. Uses the memory-efficient radix sort
% # algorithm, so peak memory usage stays low(ish) for large matrices.
% # This uses a modified radix sort where only the row remapping indexes are
% # rearranged at each step, instead of sorting the whole input, to avoid
% # having to rewrite the large input matrix.
ix = 1:size(A,1); % # Final in-out mapping indexes
% # Radix sort the rows
for iCol = size(A,2):-1:1
c = A(ix,iCol);
[~,ixStep] = sort(c);
% # Don't do this! too slow
% # A = A(ixStep,:);
% # Just reorder the mapping indexes
ix = ix(ixStep);
end
% # Now, reorder the big array all at once
A = A(ix,:);
% # Remove duplicates
tfKeep = true(size(A,1),1);
priorRow = A(1,:);
for iRow = 2:size(A,1)
thisRow = A(iRow,:);
if isequal(thisRow, priorRow)
tfKeep(iRow) = false;
else
priorRow = thisRow;
end
end
A = A(tfKeep,:);
end
When I tested this on a matrix of your size on Matlab R2014b on OS X, it peaked at around 3 GB of memory used, vs about 1 GB to hold just the input matrix. Not bad.
>> m = rand([13146,13146]);
>> tic; rrunique(m); toc
Elapsed time is 17.435783 seconds.
You can use a sliding window on columns and use a window size that doesn't cause a memory problem. Here is a solution
function A = winuninque(A, winSz)
nCol = size(A,2);
I = zeros(size(A,1), 0);
for k=1:winSz:nCol
[~, ~, I] = unique([I A(:,k:min(k+winSz-1,end))], 'rows');
end
[~, I] = unique(I);
A = A(I, :);
end
In order to actually have some duplicate rows it's better to generate the matrix with some duplicates, otherwise it will be just sorting. Here is a comparison between different approaches:
>> A=repmat(rand(13146,13146), 2, 1);
>> A=A(randperm(end), :);
>> A=A(1:end/2,:);
>> tic; B=rrunique(A); toc
Elapsed time is 13.318752 seconds.
>> tic; C=winunique(A, 16); toc
Elapsed time is 6.606122 seconds.
>> tic; D=unique(A, 'rows'); toc
Elapsed time is 29.815333 seconds.
>> isequal(B,C,D)
ans =
1
>> size(D)
ans =
9880 13146
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With