Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare Matrices in R efficiently

I have an array a with some matrices in it. Now i need to efficiently check how many different matrices I have and what indices (in ascending order) they have in the array. My approach is the following: Paste the columns of the matrixes as character vectors and have a look at the frequency table like this:

n <- 10 #observations
a <- array(round(rnorm(2*2*n),1),
           c(2,2,n))

paste_a <- apply(a, c(3), paste, collapse=" ") #paste by column
names(paste_a) <- 1:n
freq <- as.numeric( table(paste_a) ) # frequencies of different matrices (in ascending order)
indizes <- as.numeric(names(sort(paste_a[!duplicated(paste_a)]))) 
nr <- length(freq) #number of different matrices

However, as you increase n to large numbers, this gets very inefficient (it's mainly paste() that's getting slower and slower). Does anyone have a better solution?

Here is a "real" dataset with 100 observations where some matrices are actual duplicates (as opposed to my example above): https://pastebin.com/aLKaSQyF

Thank you very much.

like image 828
yrx1702 Avatar asked Dec 11 '17 13:12

yrx1702


3 Answers

Since your actual data is made up of the integers 0,1,2,3, why not take advantage of base 4? Integers are much faster to compare than entire matrix objects. (All occurrences of a below are of the data found in the real data set from the link.)

Base4Approach <- function() {
    toBase4 <- sapply(1:dim(a)[3], function(x) {
        v <- as.vector(a[,,x])
        pows <- which(v > 0)
        coefs <- v[pows]
        sum(coefs*(4^pows))
    })
    myDupes <- which(duplicated(toBase4))
    a[,,-(myDupes)]
}

And since the question is about efficiency, let's benchmark:

MartinApproach <- function() {
    ### commented this out for comparison reasons
    # dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
    a <- a[,,!duplicated(a, MARGIN = 3)]
    nr <- dim(a)[3]
    a
}

identical(MartinApproach(), Base4Approach())
[1] TRUE

microbenchmark(Base4Approach(), MartinApproach())
Unit: microseconds
            expr     min       lq      mean    median       uq      max neval
 Base4Approach() 291.658  303.525  339.2712  325.4475  352.981  636.361   100
MartinApproach() 983.855 1000.958 1160.4955 1071.9545 1187.321 3545.495   100

The approach by @d.b. doesn't really do the same thing as the previous two approaches (it simply identifies and doesn't remove duplicates).

DBApproach <- function() {
    a[, , 9] = a[, , 1]

    #Convert to list
    mylist = lapply(1:dim(a)[3], function(i) a[1:dim(a)[1], 1:dim(a)[2], i])
    temp = sapply(mylist, function(x) sapply(mylist, function(y) identical(x, y)))
    temp2 = unique(apply(temp, 1, function(x) sort(which(x))))

    #The indices in 'a' where the matrices are same
    temp2[lengths(temp2) > 1]
}

However, Base4Approach still dominates:

    microbenchmark(Base4Approach(), MartinApproach(), DBApproach())
Unit: microseconds
            expr      min         lq       mean    median         uq       max neval
 Base4Approach()  298.764   324.0555   348.8534   338.899   356.0985   476.475   100
MartinApproach() 1012.601  1087.9450  1204.1150  1110.662  1162.9985  3224.299   100
    DBApproach() 9312.902 10339.4075 11616.1644 11438.967 12413.8915 17065.494   100


Update courtesy of @alexis_laz

As mentioned in the comments by @alexis_laz, we can do much better.

AlexisBase4Approach <- function() {
    toBase4 <- colSums(a * (4 ^ (0:(prod(dim(a)[1:2]) - 1))), dims = 2)
    myDupes <- which(duplicated(toBase4))
    a[,,-(myDupes)]
}

microbenchmark(Base4Approach(), MartinApproach(), DBApproach(), AlexisBase4Approach(), unit = "relative")
Unit: relative
                 expr       min        lq       mean     median         uq        max neval
      Base4Approach()  11.67992  10.55563   8.177654   8.537209   7.128652   5.288112   100
     MartinApproach()  39.60408  34.60546  27.930725  27.870019  23.836163  22.488989   100
         DBApproach() 378.91510 342.85570 262.396843 279.190793 231.647905 108.841199   100
AlexisBase4Approach()   1.00000   1.00000   1.000000   1.000000   1.000000   1.000000   100

## Still gives accurate results
identical(MartinApproach(), AlexisBase4Approach())
[1] TRUE
like image 54
Joseph Wood Avatar answered Oct 01 '22 01:10

Joseph Wood


My first attempt was actually really slow. So here is slightly changed version of yours:

  dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
  a   <- a[,,!duplicated(a, MARGIN = 3)]
  nr  <- dim(a)[3] #number of different matrices
  idx <- dimnames(a)[[3]] # indices of left over matrices
like image 36
Martin Schmelzer Avatar answered Sep 30 '22 23:09

Martin Schmelzer


I don't know if this is exactly what you want but here is a way you can extract indices where the matrices are same. More processing may be necessary to get what you want

#DATA
n <- 10
a <- array(round(rnorm(2*2*n),1), c(2,2,n))
a[, , 9] = a[, , 1]

temp = unique(apply(X = sapply(1:dim(a)[3], function(i)
    sapply(1:dim(a)[3], function(j) identical(a[, , i], a[, , j]))),
    MARGIN = 1,
    FUN = function(x) sort(which(x))))
temp[lengths(temp) > 1]
#[[1]]
#[1] 1 9
like image 31
d.b Avatar answered Sep 30 '22 23:09

d.b