R - How to avoid loops when comparing two datasets?

Question

Edition in order to simplify the question

I have two matrix :

mat1 : nrow=100 000 ; ncol=5
mat2 : nrow=500 000 ; ncol=5

Expected Results

Count the number of similar numbers between each row of mat1 with each row of mat2 :

Proposal

   Intersection <- function(matrix1, matrix2){
        Intersection = matrix(nrow=nrow(matrix1), ncol=ncol(matrix2)) 
          for(i in 1:nrow(matrix3)) {
            for(j in 1:ncol(matrix3)) {
            Intersection[i,j] = length(intersect(matrix1[i,], matrix2[j,])
           } 
         }   
    return(Intersection) }

Question:

How to vectorize this function in order to avoid loops ?

Data sample

Here is a sample of data in order to experiment a solution:

dput(matrix1) structure(c(1L, 20L, 2L, 1L, 7L, 2L, 22L, 12L, 2L, 27L, 3L, 35L, 16L, 3L, 32L, 4L, 37L, 35L, 17L, 33L, 5L, 38L, 46L, 27L, 49L), .Dim = c(5L, 5L))

dput(matrix2) structure(c(1, 14, 7, 1, 7, 2, 22, 12, 2, 27, 7, 35, 16, 3, 32, 14, 39, 35, 17, 32, 17, 38, 46, 20, 49), .Dim = c(5L, 5L))

IRTFM · Accepted Answer

The way to improve efficiency of processing is not to throw away loops but rather to examine the inner logic of the loops. In this case it appears you want to use the number of intersecting elements in TARGET's column-i with mat's column-j as an offset to pick elements in the "IF_n" columns and place that item in the (5+i)-th row and j-th column. We should be able to get rid of all those ifelse statements when the problem is described in that manner. (I often find that spending time restating the problem in the clearest possible natural language is the key to improving efficiencies.) There will be a bit of a modulo arithmetrickery involved in getting the 0 result to index the fifth column.

I also have a problem with the logic in asking for the length of the intersection of df$TARGET[i] with a mat-column. It is only possible for df$TARGET[i] to be a single number, since you used vector indexing rather than matrix indexing. (df$TARGET is a matrix, so it should be df$TARGET[,i])

This is my counter-proposal. I think it both more in keeping with the desired outcome as well as probably at least 5 times faster, since you can completely eliminate all that ifelse folderol.)

BDfunc <- function(df, mat){
  for  (i in 1:nrow(df)) {   # print(i)  (use for debugging)
    for (j in 1:ncol(mat)){  # print(j)
     mat[5+i, j]<- df[i , 2 + (
      (length(intersect(df$TARGET[,i], mat[,j])) ) %% 5 )]   }
  }
  return(mat)
}   
 mat <- BDfunc(df, mat)

> mat
          [,1]      [,2]      [,3]      [,4]      [,5]
 [1,] 1.000000 20.000000  2.000000  1.000000  7.000000
 [2,] 2.000000 22.000000 12.000000  2.000000 27.000000
 [3,] 3.000000 35.000000 16.000000  3.000000 32.000000
 [4,] 4.000000 37.000000 35.000000 17.000000 33.000000
 [5,] 5.000000 38.000000 46.000000 27.000000 49.000000
 [6,] 5.855105  2.216690  7.458434  3.120932  2.216690
 [7,] 6.381849  6.381849  6.630405  6.381849  6.630405
 [8,] 2.464372  2.464372  2.464372  5.993037  5.993037
 [9,] 1.614552  1.614552  1.614552  5.507400  1.614552
[10,] 2.088811  2.088811  2.088811  2.088811  5.974585

R - How to avoid loops when comparing two datasets?

Tags:

loops

r

vectorization

Edition in order to simplify the question

Expected Results

Proposal

Question:

Data sample

sdata

1 Answers

IRTFM

Recent Activity

Donate For Us

R - How to avoid loops when comparing two datasets?

Tags:

loops

r

vectorization

Edition in order to simplify the question

Expected Results

Proposal

Question:

Data sample

sdata

1 Answers

IRTFM

Related questions

Recent Activity

Donate For Us