Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

permuting the rows and columns of a matrix for clustering [closed]

i have a distance matrix that is 1000x1000 in dimension and symmetric with 0s along the diagonal. i want to form groupings of distances (clusters) by simultaneously reordering the rows and columns of the matrix. this is like reordering a matrix before you visualize its clusters with a heatmap. i feel like this should be an easy problem, but i am not having much luck finding code that does the permutations online. can anyone help?

like image 906
user439463 Avatar asked Sep 04 '10 05:09

user439463


2 Answers

Here is one approach that came to mind:

  1. "Sparsify" the matrix so that only "sufficiently close" neighbors have a nonzero value in the matrix.
  2. Use a Cuthill-McKee algorithm to compress the bandwidth of the sparse matrix.
  3. Do a symmetric reordering of the original matrix using the results from Step 2.

Example

I will use Octave (everything I am doing should also work in Matlab) since it has a Reverse Cuthill-McKee (RCM) implementation built in.

First, we need to generate a distance matrix. This function creates a random set of points and their distance matrix:

function [x, y, A] = make_rand_dist_matrix(n)
  x = rand(n, 1);
  y = rand(n, 1);
  A = sqrt((repmat(x, 1, n) - repmat(x', n, 1)).^2 +
       (repmat(y, 1, n) - repmat(y', n, 1)).^2);
end

Let's use that to generate and visualize a 100-point example.

[x, y, A] = make_rand_dist_matrix(100);
surf(A);

Viewing the surface plot from above gets the image below (yours will be different, of course).

Original distance matrix.

Warm colors represent greater distances than cool colors. Row (or column, if you prefer) i in the matrix contains the distances between point i and all points. The distance between point i and point j is in entry A(i, j). Our goal is to reorder the matrix so that the row corresponding to point i is near rows corresponding to points a short distance from i.

A simple way to sparsify A is to make all entries greater than some threshold zero, and that is what is done below, although more sophisticated approaches may prove more effective.

B = A < 0.2;   % sparsify A -- only values less than 0.2 are nonzeros in B
p = symrcm(B); % compute reordering by Reverse Cuthill-McKee
surf(A(p, p)); % visualize reordered distance matrix

Reordered distance matrix.

The matrix is now ordered in a way that brings nearby points closer together in the matrix. This result is not optimal, of course. Sparse matrix bandwidth compression is computed using heuristics, and RCM is a very simple approach. As I mentioned above, more sophisticated approaches for producing the sparse matrix may give better results, and different algorithms may also yield better results for the problem.

Just for Fun

Another way to look at what happened is to plot the points and connect a pair of points if their corresponding rows in the matrix are adjacent. Your goal is to have the lines connecting pairs of points that are near each other. For a more dramatic effect, we use a larger set of points than above.

[x, y, A] = make_rand_dist_matrix(2000);
plot(x, y);   % plot the points in their initial, random order

Data points and their original ordering.

Clearly, connections are all over the place and are occurring over a wide variety of distances.

B = A < 0.2;     % sparsify A
p = symrcm(B);
plot(x(p), y(p)) % plot the reordered points

Reordered data points.

After reordering, the connections tend to be over much smaller distances and much more orderly.

like image 56
David Alber Avatar answered Nov 07 '22 20:11

David Alber


Two Matlab functions do this: symrcm and symamd. Note that there is no unique solution to this problem. Clustering is another approach.

like image 25
cyborg Avatar answered Nov 07 '22 21:11

cyborg