Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Compare all the columns pairwise in matrix

I have a matrix with 41 rows and 6 columns. This is how the first part looks like.

      X13  X15  X17  X19  X21  X23 
 [1,] "7"  "6"  "5"  "8"  "1"  "8" 
 [2,] "7"  "6"  "5"  "8"  "14" "3" 
 [3,] "7"  "6"  "1"  "3"  "12" "3" 
 [4,] "7"  "6"  "1"  "5"  "6"  "14"
 [5,] "2"  "6"  "1"  "5"  "16" "3" 
 [6,] "2"  "3"  "5"  "5"  "2"  "3" 
 [7,] "7"  "5"  "5"  "17" "7"  "3" 
 [8,] "7"  "2"  "5"  "2"  "2"  "14"
 [9,] "2"  "2"  "10" "10" "2"  "3" 
[10,] "2"  "2"  "10" "5"  "2"  "6" 

My goal is, to compare all the columns with each other, and see, how many of the numbers are equal in the 2 columns. I tried to do it like this:

s <- sum(matrix[,1]==matrix[,2])

But since I need to compare all the possible pairs, it is not effective. It would be good to put this in a loop, but I have no idea how.

And I would like to get my result in a form of a 6x6 similarity matrix. Something like this:

      X13 X15 X17 X19 X21 X23
 X13   0   0   3   2   2   3
 X15   0   0   9  11   4   6
 X17   3   9   0   5   1   3
 X19   2  11   5   0   9  10
 X21   2   4   1   9   0   9
 X23   3   6   3  10   9   0

As you see, I would like to put zeros to the matrix when a column is compared to iteslf.

Since I am a beginner R user, this task semms really complicated to me. I need to use this comparison to 50 matrixes, so I would be glad if you could help me. I would appreciate any tips/suggestions. My english is not so good either, but I hope I could explain my problem well enough. :)

like image 250
Sielu Avatar asked Nov 12 '13 15:11

Sielu


3 Answers

Here is an entirely vectorised solution using expand.grid to compute indices and colSums and matrix to wrap up the result.

#  Some reproducible 6x6 sample data
set.seed(1)
m <- matrix( sample(10,36,repl=TRUE) , ncol = 6 )
#     [,1] [,2] [,3] [,4] [,5] [,6]
#[1,]    3   10    7    4    3    5
#[2,]    4    7    4    8    4    6
#[3,]    6    7    8   10    1    5
#[4,]   10    1    5    3    4    2
#[5,]    3    3    8    7    9    9
#[6,]    9    2   10    2    4    7


#  Vector source for column combinations
n <- seq_len( ncol(m) )

#  Make combinations
id <- expand.grid( n , n )

#  Get result
out <- matrix( colSums( m[ , id[,1] ] == m[ , id[,2] ] ) , ncol = length(n) )
diag(out) <- 0
#    [,1] [,2] [,3] [,4] [,5] [,6]
#[1,]    0    1    1    0    2    0
#[2,]    1    0    0    1    0    0
#[3,]    1    0    0    0    1    0
#[4,]    0    1    0    0    0    0
#[5,]    2    0    1    0    0    1
#[6,]    0    0    0    0    1    0
like image 152
Simon O'Hanlon Avatar answered Nov 17 '22 12:11

Simon O'Hanlon


A non-vectorized, (but perhaps more memory-efficient) way of doing this:

# Fancy way.
similarity.matrix<-apply(matrix,2,function(x)colSums(x==matrix))
diag(similarity.matrix)<-0


# More understandable. But verbose.
similarity.matrix<-matrix(nrow=ncol(matrix),ncol=ncol(matrix))
for(col in 1:ncol(matrix)){
  matches<-matrix[,col]==matrix
  match.counts<-colSums(matches)
  match.counts[col]<-0 # Set the same column comparison to zero.
  similarity.matrix[,col]<-match.counts
}
like image 38
nograpes Avatar answered Nov 17 '22 13:11

nograpes


An approach using v_outer from the qdap package:

library(qdapTools) #Using Simon's data

x <- v_outer(m, function(x, y) sum(x==y))
diag(x) <- 0

##    V1 V2 V3 V4 V5 V6
## V1  0  1  1  0  2  0
## V2  1  0  0  1  0  0
## V3  1  0  0  0  1  0
## V4  0  1  0  0  0  0
## V5  2  0  1  0  0  1
## V6  0  0  0  0  1  0

EDIT I added benchmarks:

set.seed(1)
matrix <- m <- matrix( sample(10,36,repl=TRUE) , ncol = 6 )

MATRIX <- function(){
    n <- seq_len( ncol(m) )
    id <- expand.grid( n , n )
    out <- matrix( colSums( m[ , id[,1] ] == m[ , id[,2] ] ) , ncol = length(n) )
    diag(out) <- 0
    out
}

V_OUTER <- function(){
    x <- v_outer(m, function(x, y) sum(x==y))
    diag(x) <- 0
    x
}

APPLY <- function(){
    similarity.matrix<-apply(matrix,2,function(x)colSums(x==matrix))
    diag(similarity.matrix)<-0
    similarity.matrix
}

library(microbenchmark)
(op <- microbenchmark( 
    MATRIX(),
    V_OUTER(),
    APPLY() ,  
times=1000L))

Unit: microseconds
      expr     min      lq  median      uq      max neval
  MATRIX() 243.980 264.972 277.101 286.898 1719.519  1000
 V_OUTER() 203.861 223.921 234.650 243.280 1579.570  1000
   APPLY()  96.566 108.228 112.893 118.025 1470.409  1000
like image 2
Tyler Rinker Avatar answered Nov 17 '22 14:11

Tyler Rinker