Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return indices of rows whose elements (columns) all match a reference vector

Tags:

r

Using the following code;

  c <- NULL
  for (a in 1:4){
    b <- seq(from = a, to = a + 5)
    c <- rbind(c,b)
    }
  c <- rbind(c,c); rm(a,b)

Results in this matrix,

> c
  [,1] [,2] [,3] [,4] [,5] [,6]
b    1    2    3    4    5    6
b    2    3    4    5    6    7
b    3    4    5    6    7    8
b    4    5    6    7    8    9
b    1    2    3    4    5    6
b    2    3    4    5    6    7
b    3    4    5    6    7    8
b    4    5    6    7    8    9

How can I return row indices for rows matching a specific input?

For example, with a search term of,

z <- c(3,4,5,6,7,8)

I need the following returned,

[1] 3 7

This will be used in a fairly large data frame of test data, related to a time step column, to reduce the data by accumulating time steps for matching rows.


Question answered well by others. Due to my dataset size (9.5M rows), I came up with an efficient approach that took a couple steps.

1) Sort the big data frame 'dc' containing time steps to accumulate in column 1.

dc <- dc[order(dc[,2],dc[,3],dc[,4],dc[,5],dc[,6],dc[,7],dc[,8]),]

2) Create a new data frame with unique entries (excluding column 1).

dcU <- unique(dc[,2:8])

3) Write Rcpp (C++) function to loop through unique data frame which iterates through the original data frame accumulating time while rows are equal and indexes to the next for loop step when an unequal row is identified.

  require(Rcpp)
  getTsrc <-
    '
  NumericVector getT(NumericMatrix dc, NumericMatrix dcU)
  {
  int k = 0;
  int n = dcU.nrow();
  NumericVector tU(n);
  for (int i = 0; i<n; i++)
    {
    while ((dcU(i,0)==dc(k,1))&&(dcU(i,1)==dc(k,2))&&(dcU(i,2)==dc(k,3))&&
           (dcU(i,3)==dc(k,4))&&(dcU(i,4)==dc(k,5))&&(dcU(i,5)==dc(k,6))&&
           (dcU(i,6)==dc(k,7)))
      {
      tU[i] = tU[i] + dc(k,0);
      k++;
      }
    }
  return(tU);
  }
    '
  cppFunction(getTsrc) 

4) Convert function inputs to matrices.

  dc1 <- as.matrix(dc)
  dcU1 <- as.matrix(dcU)

5) Run the function and time it (returns time vector matching unique data frame)

  pt <- proc.time()
  t <- getT(dc1, dcU1)
  print(proc.time() - pt)

   user  system elapsed 
   0.18    0.03    0.20 

6) Self high-five and more coffee.

like image 322
Scott Smith Avatar asked Dec 08 '15 14:12

Scott Smith


1 Answers

You can use apply.

Here we use apply on c, across rows (the 1), and use a function function(x) all(x == z) on each row.

The which then pulls out the integer positions of the rows.

which(apply(c, 1, function(x) all(x == z)))
b b 
3 7

EDIT: If your real data is having problems with this, and is only 9 columns (not too much typing), you could try a fully vectorized solution:

which((c[,1]==z[1] & c[,2]==z[2] & c[,3]==z[3] & c[,4]==z[4]& c[,5]==z[5]& c[,6]==z[6]))
like image 186
jeremycg Avatar answered Sep 19 '22 04:09

jeremycg