Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bootstrap variables correlation in R

My intention was to write several functions aimed at finding the overall similarity between two covariance matrices, either by multiplying them with random vectors and correlating the response vectors or by bootstrapping one of the matrices to obtain the correlation coefficient distribution that can serve for comparison. But in both cases I got erroneous results. The observed between-matrix correlation was high up to 0.93, but the distribution only ranged up to 0.2 the most. This is the function`s code:

resamplerSimAlt <- function(mat1, mat2, numR, graph = FALSE)
{
  statSim <- numeric(numR)
  mat1vcv <- cov(mat1)
  mat2vcvT <- cov(mat2)
  ltM1 <- mat1vcv[col(mat1vcv) <= row(mat1vcv)]
  ltM2T <- mat2vcvT[col(mat2vcvT) <= row(mat2vcvT)]
  statObs <- cor(ltM1, ltM2T)                           
  indice <- c(1:length(mat2))
  resamplesIndices <- lapply(1:numR, function(i) sample(indice, replace = F))
  for (i in 1:numR)
  {
    ss <- mat2[sample(resamplesIndices[[i]])]
    ss <- matrix(ss, nrow = dim(mat2)[[1]], ncol = dim(mat2)[[2]])
    mat2ss <- cov(ss)
    ltM2ss <- mat2ss[col(mat2ss) <= row(mat2ss)]
    statSim[i] <- cor(ltM1, ltM2ss)
  }
  if (graph == TRUE)
  {
    plot(1, main = "resampled data density distribution", xlim = c(0, statObs+0.1), ylim = c(0,14))
    points(density(statSim), type="l", lwd=2)
    abline(v = statObs)
    text(10, 10, "observed corelation = ")
  }
  list( obs = statObs , sumFit = sum(statSim > statObs)/numR)
}  

In fact it is hard for me to belive that correlation coefficient between two original matrices is high, and the one between the first original matrix and the second re-sampled one is maximal 0.2 after 10000 bootstrap repetitions.

Any comments on the validity of the code?

like image 420
Fedja Blagojevic Avatar asked Nov 05 '22 12:11

Fedja Blagojevic


1 Answers

Sorry, I am not enough educated to catch up your goals about checking the correlation efficient between two covariance matrices, but I tried to apprehend your code per se.

If I am right, you are making up 10.000 different matrices from the same matrix (mat2) by reordering the cells all round, and recomputing the correlation between the covariance matrix of mat1 and the covariance matrix of the resampled array. Those are stored in the statSim variable.

You said the original correaltion efficient was high (statObs), but the maximum of statSim was low, which is strange. I think the problem is with your result list:

list( obs = statObs , sumFit = sum(statSim > statObs)/numR)

Where you return the original correaltion coefficient (obs), but not the written maximum with sumFit. There you might use eg. max(statSim). I see the point in returning sumFit for checking if the resampling did any improvement to the correlation efficient, but based on your code, I see no problem about the theory.

Updated function with max of simulated correlation coefficients:

resamplerSimAlt <- function(mat1, mat2, numR, graph = FALSE)
{
  statSim <- numeric(numR)
  mat1vcv <- cov(mat1)
  mat2vcvT <- cov(mat2)
  ltM1 <- mat1vcv[col(mat1vcv) <= row(mat1vcv)]
  ltM2T <- mat2vcvT[col(mat2vcvT) <= row(mat2vcvT)]
  statObs <- cor(ltM1, ltM2T)                           
  indice <- c(1:length(mat2))
  resamplesIndices <- lapply(1:numR, function(i) sample(indice, replace = F))
  for (i in 1:numR)
  {
    ss <- mat2[sample(resamplesIndices[[i]])]
    ss <- matrix(ss, nrow = dim(mat2)[[1]], ncol = dim(mat2)[[2]])
    mat2ss <- cov(ss)
    ltM2ss <- mat2ss[col(mat2ss) <= row(mat2ss)]
    statSim[i] <- cor(ltM1, ltM2ss)
  }
  if (graph == TRUE)
  {
    plot(1, main = "resampled data density distribution", xlim = c(0, statObs+0.1), ylim = c(0,14))
    points(density(statSim), type="l", lwd=2)
    abline(v = statObs)
    text(10, 10, "observed corelation = ")
  }
  list( obs = statObs , sumFit = sum(statSim > statObs)/numR, max=max(statSim))
}

What I had run:

> mat1 <- matrix(runif(25),5,5)
> mat2 <- mat1+0.2
> resamplerSimAlt(mat1, mat2, 10000)
$obs
[1] 1

$sumFit
[1] 0

$max
[1] 0.94463

And with random mat2:

> mat2 <- matrix(runif(25),5,5)
> resamplerSimAlt(mat1, mat2, 10000)
$obs
[1] 0.31144

$sumFit
[1] 0.9124

$max
[1] 0.9231

My answer might not be a real answer. If that would be the case, please give more details about the problem :)

like image 113
daroczig Avatar answered Nov 09 '22 03:11

daroczig