Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - How to avoid loops when comparing two datasets?

Edition in order to simplify the question

I have two matrix :

  • mat1 : nrow=100 000 ; ncol=5
  • mat2 : nrow=500 000 ; ncol=5

Expected Results

Count the number of similar numbers between each row of mat1 with each row of mat2 :

Proposal

   Intersection <- function(matrix1, matrix2){
        Intersection = matrix(nrow=nrow(matrix1), ncol=ncol(matrix2)) 
          for(i in 1:nrow(matrix3)) {
            for(j in 1:ncol(matrix3)) {
            Intersection[i,j] = length(intersect(matrix1[i,], matrix2[j,])
           } 
         }   
    return(Intersection) }

Question:

How to vectorize this function in order to avoid loops ?

Data sample

Here is a sample of data in order to experiment a solution:

dput(matrix1) structure(c(1L, 20L, 2L, 1L, 7L, 2L, 22L, 12L, 2L, 27L, 3L, 35L, 16L, 3L, 32L, 4L, 37L, 35L, 17L, 33L, 5L, 38L, 46L, 27L, 49L), .Dim = c(5L, 5L))

dput(matrix2) structure(c(1, 14, 7, 1, 7, 2, 22, 12, 2, 27, 7, 35, 16, 3, 32, 14, 39, 35, 17, 32, 17, 38, 46, 20, 49), .Dim = c(5L, 5L))

like image 200
sdata Avatar asked Nov 01 '22 14:11

sdata


1 Answers

The way to improve efficiency of processing is not to throw away loops but rather to examine the inner logic of the loops. In this case it appears you want to use the number of intersecting elements in TARGET's column-i with mat's column-j as an offset to pick elements in the "IF_n" columns and place that item in the (5+i)-th row and j-th column. We should be able to get rid of all those ifelse statements when the problem is described in that manner. (I often find that spending time restating the problem in the clearest possible natural language is the key to improving efficiencies.) There will be a bit of a modulo arithmetrickery involved in getting the 0 result to index the fifth column.

I also have a problem with the logic in asking for the length of the intersection of df$TARGET[i] with a mat-column. It is only possible for df$TARGET[i] to be a single number, since you used vector indexing rather than matrix indexing. (df$TARGET is a matrix, so it should be df$TARGET[,i])

This is my counter-proposal. I think it both more in keeping with the desired outcome as well as probably at least 5 times faster, since you can completely eliminate all that ifelse folderol.)

BDfunc <- function(df, mat){
  for  (i in 1:nrow(df)) {   # print(i)  (use for debugging)
    for (j in 1:ncol(mat)){  # print(j)
     mat[5+i, j]<- df[i , 2 + (
      (length(intersect(df$TARGET[,i], mat[,j])) ) %% 5 )]   }
  }
  return(mat)
}   
 mat <- BDfunc(df, mat)

> mat
          [,1]      [,2]      [,3]      [,4]      [,5]
 [1,] 1.000000 20.000000  2.000000  1.000000  7.000000
 [2,] 2.000000 22.000000 12.000000  2.000000 27.000000
 [3,] 3.000000 35.000000 16.000000  3.000000 32.000000
 [4,] 4.000000 37.000000 35.000000 17.000000 33.000000
 [5,] 5.000000 38.000000 46.000000 27.000000 49.000000
 [6,] 5.855105  2.216690  7.458434  3.120932  2.216690
 [7,] 6.381849  6.381849  6.630405  6.381849  6.630405
 [8,] 2.464372  2.464372  2.464372  5.993037  5.993037
 [9,] 1.614552  1.614552  1.614552  5.507400  1.614552
[10,] 2.088811  2.088811  2.088811  2.088811  5.974585
like image 141
IRTFM Avatar answered Nov 15 '22 07:11

IRTFM