Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What techniques exists in R to visualize a "distance matrix"?

I wish to present a distance matrix in an article I am writing, and I am looking for good visualization for it.

So far I came across balloon plots (I used it here, but I don't think it will work in this case), heatmaps (here is a nice example, but they don't allow to present the numbers in the table, correct me if I am wrong. Maybe half the table in colors and half with numbers would be cool) and lastly correlation ellipse plots (here is some code and example - which is cool to use a shape, but I am not sure how to use it here).

There are also various clustering methods but they will aggregate the data (which is not what I want) while what I want is to present all of the data.

Example data:

nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dist(nba[1:20, -1], )

I am open for ideas.

like image 486
Tal Galili Avatar asked Jun 20 '10 21:06

Tal Galili


People also ask

How do you find the distance of a matrix in R?

The dist() function in R can be used to calculate a distance matrix, which displays the distances between the rows of a matrix or data frame. where: x: The name of the matrix or data frame.

How many elements are in the distance matrix in R?

In general, for a data sample of size M, the distance matrix is an M × M symmetric matrix with M × (M - 1)∕2 distinct elements.


7 Answers

You could also use force-directed graph drawing algorithms to visualize a distance matrix, e.g.

nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dist_m <- as.matrix(dist(nba[1:20, -1]))
dist_mi <- 1/dist_m # one over, as qgraph takes similarity matrices as input
library(qgraph)
jpeg('example_forcedraw.jpg', width=1000, height=1000, unit='px')
qgraph(dist_mi, layout='spring', vsize=3)
dev.off()

like image 54
jmb Avatar answered Oct 03 '22 02:10

jmb


A Voronoi Diagram (a plot of a Voronoi Decomposition) is one way to visually represent a Distance Matrix (DM).

They are also simple to create and plot using R--you can do both in a single line of R code.

If you're not famililar with this aspect of computational geometry, the relationship between the two (VD & DM) is straightforward, though a brief summary might be helpful.

Distance Matrices--i.e., a 2D matrix showing the distance between a point and every other point, are an intermediate output during kNN computation (i.e., k-nearest neighbor, a machine learning algorithm which predicts the value of a given data point based on the weighted average value of its 'k' closest neighbors, distance-wise, where 'k' is some integer, usually between 3 and 5.)

kNN is conceptually very simple--each data point in your training set is in essence a 'position' in some n-dimension space, so the next step is to calculate the distance between each point and every other point using some distance metric (e.g., Euclidean, Manhattan, etc.). While the training step--i.e., construcing the distance matrix--is straightforward, using it to predict the value of new data points is practically encumbered by the data retrieval--finding the closest 3 or 4 points from among several thousand or several million scattered in n-dimensional space.

Two data structures are commonly used to address that problem: kd-trees and Voroni decompositions (aka "Dirichlet tesselation").

A Voronoi decomposition (VD) is uniquely determined by a distance matrix--i.e., there's a 1:1 map; so indeed it is a visual representation of the distance matrix, although again, that's not their purpose--their primary purpose is the efficient storage of the data used for kNN-based prediction.

Beyond that, whether it's a good idea to represent a distance matrix this way probably depends most of all on your audience. To most, the relationship between a VD and the antecedent distance matrix will not be intuitive. But that doesn't make it incorrect--if someone without any statistics training wanted to know if two populations had similar probability distributions and you showed them a Q-Q plot, they would probably think you haven't engaged their question. So for those who know what they are looking at, a VD is a compact, complete, and accurate representation of a DM.

So how do you make one?

A Voronoi decomp is constructed by selecting (usually at random) a subset of points from within the training set (this number varies by circumstances, but if we had 1,000,000 points, then 100 is a reasonable number for this subset). These 100 data points are the Voronoi centers ("VC").

The basic idea behind a Voronoi decomp is that rather than having to sift through the 1,000,000 data points to find the nearest neighbors, you only have to look at these 100, then once you find the closest VC, your search for the actual nearest neighbors is restricted to just the points within that Voronoi cell. Next, for each data point in the training set, calculate the VC it is closest to. Finally, for each VC and its associated points, calculate the convex hull--conceptually, just the outer boundary formed by that VC's assigned points that are farthest from the VC. This convex hull around the Voronoi center forms a "Voronoi cell." A complete VD is the result from applying those three steps to each VC in your training set. This will give you a perfect tesselation of the surface (See the diagram below).

To calculate a VD in R, use the tripack package. The key function is 'voronoi.mosaic' to which you just pass in the x and y coordinates separately--the raw data, not the DM--then you can just pass voronoi.mosaic to 'plot'.

library(tripack)
plot(voronoi.mosaic(runif(100), runif(100), duplicate="remove"))

enter image description here

like image 26
doug Avatar answered Oct 03 '22 01:10

doug


Tal, this is a quick way to overlap text over an heatmap. Note that this relies on image rather than heatmap as the latter offsets the plot, making it more difficult to put text in the correct position.

To be honest, I think this graph shows too much information, making it a bit difficult to read... you may want to write only specific values.

also, the other quicker option is to save your graph as pdf, import it in Inkscape (or similar software) and manually add the text where needed.

Hope this helps

nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")

dst <- dist(nba[1:20, -1],)
dst <- data.matrix(dst)

dim <- ncol(dst)

image(1:dim, 1:dim, dst, axes = FALSE, xlab="", ylab="")

axis(1, 1:dim, nba[1:20,1], cex.axis = 0.5, las=3)
axis(2, 1:dim, nba[1:20,1], cex.axis = 0.5, las=1)

text(expand.grid(1:dim, 1:dim), sprintf("%0.1f", dst), cex=0.6)

enter image description here

like image 40
nico Avatar answered Oct 03 '22 03:10

nico


You may want to consider looking at a 2-d projection of your matrix (Multi Dimensional Scaling). Here is a link to how to do it in R.

Otherwise, I think you are on the right track with heatmaps. You can add in your numbers without too much difficulty. For example, building of off Learn R :

library(ggplot2)
library(plyr)
library(arm)
library(reshape2)
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
nba$Name <- with(nba, reorder(Name, PTS))
nba.m <- melt(nba)
nba.m <- ddply(nba.m, .(variable), transform,
rescale = rescale(value))
(p <- ggplot(nba.m, aes(variable, Name)) + geom_tile(aes(fill = rescale),
colour = "white") + scale_fill_gradient(low = "white",
high = "steelblue")+geom_text(aes(label=round(rescale,1))))

enter image description here

like image 44
Ian Fellows Avatar answered Oct 03 '22 02:10

Ian Fellows


In the book "Numerical Ecology" by Borcard et al. 2011 they used a function called *coldiss.r * you can find it here: http://ichthyology.usm.edu/courses/multivariate/coldiss.R

it color codes the distances and even orders the records by dissimilarity.

another good package would be the seriation package.

Reference: Borcard, D., Gillet, F. & Legendre, P. (2011) Numerical Ecology with R. Springer.

enter image description here

like image 41
Jens Avatar answered Oct 03 '22 02:10

Jens


  1. A dendrogram based on a hierarchical cluster analysis can be useful: http://www.statmethods.net/advstats/cluster.html

  2. A 2-D or 3-D multidimensional scaling analysis in R: http://www.statmethods.net/advstats/mds.html

  3. If you want to go into 3+ dimensions, you might want to explore ggobi / rggobi: http://www.ggobi.org/rggobi/

like image 40
Jeromy Anglim Avatar answered Oct 03 '22 02:10

Jeromy Anglim


A solution using Multidimensional Scaling

data = read.csv("http://datasets.flowingdata.com/ppg2008.csv", sep = ",")
dst = tcrossprod(as.matrix(data[,-1]))
dst = matrix(rep(diag(dst), 50L), ncol = 50L, byrow = TRUE) + 
  matrix(rep(diag(dst), 50L), ncol = 50L, byrow = FALSE) - 2*dst

library(MASS)
mds = isoMDS(dst)
#remove {type = "n"} to see dots
plot(mds$points, type = "n", pch = 20, cex = 3, col = adjustcolor("black", alpha = 0.3), xlab = "X", ylab = "Y") 
text(mds$points, labels = rownames(data), cex = 0.75)

enter image description here

like image 30
catastrophic-failure Avatar answered Oct 03 '22 03:10

catastrophic-failure