Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

summing 2 distance matrices for getting a third 'overall' distance matrix (ecological context)

I am ecologist, using mainly the vegan R package.

I have 2 matrices (sample x abundances) (See data below):

matrix 1/ nrow= 6replicates*24sites, ncol=15 species abundances (fish) matrix 2/ nrow= 3replicates*24sites, ncol=10 species abundances (invertebrates)

The sites are the same in both matrices. I want to get the overall bray-curtis dissimilarity (considering both matrices) among pairs of sites. I see 2 options:

option 1, averaging over replicates (at the site scale) fishes and macro-invertebrates abundances, cbind the two mean abundances matrix (nrow=24sites, ncol=15+10 mean abundances) and calculating bray-curtis.

option 2, for each assemblage, computing bray-curtis dissimilarity among pairs of sites, computing distances among sites centroids. Then summing up the 2 distance matrix.

In case I am not clear, I did these 2 operations in the R codes below.

Please, could you tell me if the option 2 is correct and more appropriate than option 1.

thank you in advance.

Pierre

here is below the R code exemples

generating data

library(plyr);library(vegan)

#assemblage 1: 15 fish species, 6 replicates per site
a1.env=data.frame(
  Habitat=paste("H",gl(2,12*6),sep=""),
  Site=paste("S",gl(24,6),sep=""),
  Replicate=rep(paste("R",1:6,sep=""),24))

summary(a1.env)

a1.bio=as.data.frame(replicate(15,rpois(144,sample(1:10,1))))

names(a1.bio)=paste("F",1:15,sep="")

a1.bio[1:72,]=2*a1.bio[1:72,]

#assemblage 2: 10 taxa of macro-invertebrates, 3 replicates per site

a2.env=a1.env[a1.env$Replicate%in%c("R1","R2","R3"),]

summary(a2.env)

a2.bio=as.data.frame(replicate(10,rpois(72,sample(10:100,1))))

names(a2.bio)=paste("I",1:10,sep="")

a2.bio[1:36,]=0.5*a2.bio[1:36,]


#environmental data at the sit scale

env=unique(a1.env[,c("Habitat","Site")])

env=env[order(env$Site),]

OPTION 1, averaging abundances and cbind

a1.bio.mean=ddply(cbind(a1.bio,a1.env),.(Habitat,Site),numcolwise(mean))

a1.bio.mean=a1.bio.mean[order(a1.bio.mean$Site),]

a2.bio.mean=ddply(cbind(a2.bio,a2.env),.(Habitat,Site),numcolwise(mean))

a2.bio.mean=a2.bio.mean[order(a2.bio.mean$Site),]

bio.mean=cbind(a1.bio.mean[,-c(1:2)],a2.bio.mean[,-c(1:2)])

dist.mean=vegdist(sqrt(bio.mean),"bray")

OPTION 2, computing for each assemblage distance among centroids and summing the 2 distances matrix

a1.dist=vegdist(sqrt(a1.bio),"bray")

a1.coord.centroid=betadisper(a1.dist,a1.env$Site)$centroids

a1.dist.centroid=vegdist(a1.coord.centroid,"eucl")

a2.dist=vegdist(sqrt(a2.bio),"bray")

a2.coord.centroid=betadisper(a2.dist,a2.env$Site)$centroids

a2.dist.centroid=vegdist(a2.coord.centroid,"eucl")

summing up the two distance matrices using Gavin Simpson 's fuse()

dist.centroid=fuse(a1.dist.centroid,a2.dist.centroid,weights=c(15/25,10/25))

summing up the two euclidean distance matrices (thanks to Jari Oksanen correction)

dist.centroid=sqrt(a1.dist.centroid^2 + a2.dist.centroid^2)

and the 'coord.centroid' below for further distance-based analysis (is it correct ?)

coord.centroid=cmdscale(dist.centroid,k=23,add=TRUE)

COMPARING OPTION 1 AND 2

pco.mean=cmdscale(vegdist(sqrt(bio.mean),"bray"))

pco.centroid=cmdscale(dist.centroid)

comparison=procrustes(pco.centroid,pco.mean)

protest(pco.centroid,pco.mean)
like image 720
Pierre Avatar asked Jan 24 '14 12:01

Pierre


2 Answers

An easier solution is just to flexibly combine the two dissimilarity matrices, by weighting each matrix. The weights need to sum to 1. For two dissimilarity matrices the fused dissimilarity matrix is

d.fused = (w * d.x) + ((1 - w) * d.y)

where w is a numeric scalar (length 1 vector) weight. If you have no reason to weight one of the sets of dissimilarities more than the other, just use w = 0.5.

I have a function to do this for you in my analogue package; fuse(). The example from ?fuse is

 train1 <- data.frame(matrix(abs(runif(100)), ncol = 10))
 train2 <- data.frame(matrix(sample(c(0,1), 100, replace = TRUE),
                      ncol = 10))
 rownames(train1) <- rownames(train2) <- LETTERS[1:10]
 colnames(train1) <- colnames(train2) <- as.character(1:10)

 d1 <- vegdist(train1, method = "bray")
 d2 <- vegdist(train2, method = "jaccard")

 dd <- fuse(d1, d2, weights = c(0.6, 0.4))
 dd
 str(dd)

This idea is used in supervised Kohonen networks (supervised SOMs) to bring multiple layers of data into a single analysis.

analogue works closely with vegan so there won't be any issues running the two packages side by side.

like image 176
Gavin Simpson Avatar answered Oct 30 '22 01:10

Gavin Simpson


The correctness of averaging distances depends on what are you doing with those distances. In some applications you may expect that they really are distances. That is, they satisfy some metric properties and have a defined relation to the original data. Combined dissimilarities may not satisfy these requirements.

This issue is related to the controversy of partial Mantel type analysis of dissimilarities vs. analysis of rectangular data that is really hot (and I mean red hot) in studies of beta diversities. We in vegan provide tools for both, but I think that in most cases analysis of rectangular data is more robust and more powerful. With rectangular data I mean normal sampling units times species matrix. The preferred dissimilarity based methods in vegan map dissimilarities onto rectangular form. These methods in vegan include db-RDA (capscale), permutational MANOVA (adonis) and analysis of within-group dispersion (betadisper). Methods working with disismilarities as such include mantel, anosim, mrpp, meandis.

The mean of dissimilarities or distances usually has no clear correspondence to the original rectangular data. That is: mean of the dissimilarities does not correspond to the mean of the data. I think that in general it is better to average or handle data and then get dissimilarities from transformed data.

If you want to combine dissimilarities, analogue::fuse() style approach is most practical. However, you should understand that fuse() also scales dissimilarity matrices into equal maxima. If you have dissimilarity measures in scale 0..1, this is usually minor issue, unless one of the data set is more homogeneous and has a lower maximum dissimilarity than others. In fuse() they are all equalized so that it is not a simple averaging but averaging after range equalizing. Moreover, you must remember that averaging dissimilarities usually destroys the geometry, and this will matter if you use analysis methods for rectangularized data (adonis, betadisper, capscale in vegan).

Finally about geometry of combining dissimilarities. Dissimilarity indices in scale 0..1 are fractions of type A/B. Two fractions can be added (and then divided to get the average) directly only if the denominators are equal. If you ignore this and directly average the fractions, then the result will not be equal to the same fraction from averaged data. This is what I mean with destroying geometry. Some open-scaled indices are not fractions and may be additive. Manhattan distances are additive. Euclidean distances are square roots of squared differences, and their squares are additive but not the distances directly.

I demonstrate these things by showing the effect of adding together two dissimilarities (and averaging would mean dividing the result by two, or by suitable weights). I take the Barro Colorado Island data of vegan and divide it into two subsets of slightly unequal sizes. A geometry preserving addition of distances of subsets of the data will give the same result as the analysis of the complete data:

library(vegan) ## data and vegdist
library(analogue) ## fuse
data(BCI)
dim(BCI) ## [1]  50 225
x1 <- BCI[, 1:100]
x2 <- BCI[, 101:225]
## Bray-Curtis and fuse: not additive
plot(vegdist(BCI), fuse(vegdist(x1), vegdist(x2), weights = c(100/225, 125/225)))
## summing distances is straigthforward (they are vectors), but preserving
## their attributes and keeping the dissimilarities needs fuse or some trick
## like below where we make dist structure dtmp to be replaced with the result
dtmp <- dist(BCI) ## dist skeleton with attributes
dtmp[] <- dist(x1, "manhattan") + dist(x2, "manhattan")
## manhattans are additive and can be averaged
plot(dist(BCI, "manhattan"), dtmp)
## Fuse rescales dissimilarities and they are no more additive
dfuse <- fuse(dist(x1, "man"), dist(x2, "man"), weights=c(100/225, 125/225))
plot(dist(BCI, "manhattan"), dfuse)
## Euclidean distances are not additive
dtmp[] <- dist(x1) + dist(x2)
plot(dist(BCI), dtmp)
## ... but squared Euclidean distances are additive
dtmp[] <- sqrt(dist(x1)^2 + dist(x2)^2)
plot(dist(BCI), dtmp)
## dfuse would rescale squared Euclidean distances like Manhattan (not shown)

I only considered addition above, but if you cannot add, you cannot average. It is a matter of taste if this is important. Brave people will average things that cannot be averaged, but some people are more timid and want to follow the rules. I rather go the second group.

like image 44
Jari Oksanen Avatar answered Oct 30 '22 01:10

Jari Oksanen