I have a matrix with 41 rows and 6 columns. This is how the first part looks like.
X13 X15 X17 X19 X21 X23
[1,] "7" "6" "5" "8" "1" "8"
[2,] "7" "6" "5" "8" "14" "3"
[3,] "7" "6" "1" "3" "12" "3"
[4,] "7" "6" "1" "5" "6" "14"
[5,] "2" "6" "1" "5" "16" "3"
[6,] "2" "3" "5" "5" "2" "3"
[7,] "7" "5" "5" "17" "7" "3"
[8,] "7" "2" "5" "2" "2" "14"
[9,] "2" "2" "10" "10" "2" "3"
[10,] "2" "2" "10" "5" "2" "6"
My goal is, to compare all the columns with each other, and see, how many of the numbers are equal in the 2 columns. I tried to do it like this:
s <- sum(matrix[,1]==matrix[,2])
But since I need to compare all the possible pairs, it is not effective. It would be good to put this in a loop, but I have no idea how.
And I would like to get my result in a form of a 6x6 similarity matrix. Something like this:
X13 X15 X17 X19 X21 X23
X13 0 0 3 2 2 3
X15 0 0 9 11 4 6
X17 3 9 0 5 1 3
X19 2 11 5 0 9 10
X21 2 4 1 9 0 9
X23 3 6 3 10 9 0
As you see, I would like to put zeros to the matrix when a column is compared to iteslf.
Since I am a beginner R user, this task semms really complicated to me. I need to use this comparison to 50 matrixes, so I would be glad if you could help me. I would appreciate any tips/suggestions. My english is not so good either, but I hope I could explain my problem well enough. :)
Here is an entirely vectorised solution using expand.grid
to compute indices and colSums
and matrix
to wrap up the result.
# Some reproducible 6x6 sample data
set.seed(1)
m <- matrix( sample(10,36,repl=TRUE) , ncol = 6 )
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 3 10 7 4 3 5
#[2,] 4 7 4 8 4 6
#[3,] 6 7 8 10 1 5
#[4,] 10 1 5 3 4 2
#[5,] 3 3 8 7 9 9
#[6,] 9 2 10 2 4 7
# Vector source for column combinations
n <- seq_len( ncol(m) )
# Make combinations
id <- expand.grid( n , n )
# Get result
out <- matrix( colSums( m[ , id[,1] ] == m[ , id[,2] ] ) , ncol = length(n) )
diag(out) <- 0
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 0 1 1 0 2 0
#[2,] 1 0 0 1 0 0
#[3,] 1 0 0 0 1 0
#[4,] 0 1 0 0 0 0
#[5,] 2 0 1 0 0 1
#[6,] 0 0 0 0 1 0
A non-vectorized, (but perhaps more memory-efficient) way of doing this:
# Fancy way.
similarity.matrix<-apply(matrix,2,function(x)colSums(x==matrix))
diag(similarity.matrix)<-0
# More understandable. But verbose.
similarity.matrix<-matrix(nrow=ncol(matrix),ncol=ncol(matrix))
for(col in 1:ncol(matrix)){
matches<-matrix[,col]==matrix
match.counts<-colSums(matches)
match.counts[col]<-0 # Set the same column comparison to zero.
similarity.matrix[,col]<-match.counts
}
An approach using v_outer
from the qdap package:
library(qdapTools) #Using Simon's data
x <- v_outer(m, function(x, y) sum(x==y))
diag(x) <- 0
## V1 V2 V3 V4 V5 V6
## V1 0 1 1 0 2 0
## V2 1 0 0 1 0 0
## V3 1 0 0 0 1 0
## V4 0 1 0 0 0 0
## V5 2 0 1 0 0 1
## V6 0 0 0 0 1 0
EDIT I added benchmarks:
set.seed(1)
matrix <- m <- matrix( sample(10,36,repl=TRUE) , ncol = 6 )
MATRIX <- function(){
n <- seq_len( ncol(m) )
id <- expand.grid( n , n )
out <- matrix( colSums( m[ , id[,1] ] == m[ , id[,2] ] ) , ncol = length(n) )
diag(out) <- 0
out
}
V_OUTER <- function(){
x <- v_outer(m, function(x, y) sum(x==y))
diag(x) <- 0
x
}
APPLY <- function(){
similarity.matrix<-apply(matrix,2,function(x)colSums(x==matrix))
diag(similarity.matrix)<-0
similarity.matrix
}
library(microbenchmark)
(op <- microbenchmark(
MATRIX(),
V_OUTER(),
APPLY() ,
times=1000L))
Unit: microseconds
expr min lq median uq max neval
MATRIX() 243.980 264.972 277.101 286.898 1719.519 1000
V_OUTER() 203.861 223.921 234.650 243.280 1579.570 1000
APPLY() 96.566 108.228 112.893 118.025 1470.409 1000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With