Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

faster way to compare rows in a data frame

Tags:

r

Consider the data frame below. I want to compare each row with rows below and then take the rows that are equal in more than 3 values.

I wrote the code below, but it is very slow if you have a large data frame.

How could I do that faster?

data <- as.data.frame(matrix(c(10,11,10,13,9,10,11,10,14,9,10,10,8,12,9,10,11,10,13,9,13,13,10,13,9), nrow=5, byrow=T))
rownames(data)<-c("sample_1","sample_2","sample_3","sample_4","sample_5")

>data
          V1 V2 V3 V4 V5
sample_1  10 11 10 13  9
sample_2  10 11 10 14  9
sample_3  10 10  8 12  9
sample_4  10 11 10 13  9
sample_5  13 13 10 13  9

output <- data.frame(sample = NA, duplicate = NA, matches = NA)
dfrow <- 1
for(i in 1:nrow(data)) {
    sample <- data[i, ]
    for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
    matches <- 0
        for(V in 1:ncol(data)) {
            if(data[j,V] == sample[,V]) {       
                matches <- matches + 1
            }
        }
        if(matches > 3) {
            duplicate <- data[j, ]
            pair <- cbind(rownames(sample), rownames(duplicate), matches)
            output[dfrow, ] <- pair
            dfrow <- dfrow + 1
        }
    }
}

>output
   sample    duplicate    matches
1 sample_1   sample_2     4
2 sample_1   sample_4     5
3 sample_2   sample_4     4
like image 643
vitor Avatar asked Nov 01 '13 03:11

vitor


People also ask

How to compare two data frames in Python?

In the first case, we’ll compare the first two data sets ie) data1 and data2. Based on all_equal function we can check whether the two data frames are equal or not. Now you can see the function returned as TRUE, indicates both data sets are equal. arsenal. By default, the data frames are compared by row-by-row. You can change this using the

How many seconds does it take to iterate through Dataframe?

Dataframes are Pandas-object with rows and columns. The rows and columns of the data frame are indexed, and one can loop over the indexes to iterate through the rows. It took nearly 223 seconds (approx 9x times faster than iterrows function) to iterate over the data frame and perform the strip operation.

What is iterrows in pandas Dataframe?

Iterrows According to the official documentation, iterrows () iterates "over the rows of a Pandas DataFrame as (index, Series) pairs". It converts each row into a Series object, which causes two problems:

Do I need to iterate through the data frame in Python?

Most of the time, these functions suffice what you need to achieve. When it comes to time series data though, I often need to iterate through the data frame and perform ad-hoc sliding window calculations in my python code.


1 Answers

Here is an Rcpp solution. However, if the result matrix gets too big (i.e., there are too many hits), this will throw an error. I run the loops twice, first to get the necessary size of the result matrix and then to fill it. There is probably a better possibility. Also, obviously, this will only work with integers. If your matrix is numeric, you'll have to deal with floating point precision.

library(Rcpp)
library(inline)

#C++ code:
body <- '
const IntegerMatrix        M(as<IntegerMatrix>(MM));
const int                  m=M.ncol(), n=M.nrow();
long                        count1;
int                         count2;
count1 = 0;
for (int i=0; i<(n-1); i++)
{
   for (int j=(i+1); j<n; j++)
   {
     count2 = 0;
     for (int k=0; k<m; k++) {
        if (M(i,k)==M(j,k)) count2++;
     }
     if (count2>3) count1++;
   } 
}
IntegerMatrix              R(count1,3);
count1 = 0;
for (int i=0; i<(n-1); i++)
{
   for (int j=(i+1); j<n; j++)
   {
     count2 = 0;
     for (int k=0; k<m; k++) {
        if (M(i,k)==M(j,k)) count2++;
     }
     if (count2>3) {
        count1++;
        R(count1-1,0) = i+1;
        R(count1-1,1) = j+1;
        R(count1-1,2) = count2;
     }
   } 
}
return  wrap(R);
'

fun <- cxxfunction(signature(MM = "matrix"), 
                     body,plugin="Rcpp")

#with your data
fun(as.matrix(data))
#      [,1] [,2] [,3]
# [1,]    1    2    4
# [2,]    1    4    5
# [3,]    2    4    4

#Benchmarks
set.seed(42)
mat1 <- matrix(sample(1:10,250*26,TRUE),ncol=26)
mat2 <- matrix(sample(1:10,2500*26,TRUE),ncol=26)
mat3 <- matrix(sample(1:10,10000*26,TRUE),ncol=26)
mat4 <- matrix(sample(1:10,25000*26,TRUE),ncol=26)
library(microbenchmark)
microbenchmark(
  fun(mat1),
  fun(mat2),
  fun(mat3),
  fun(mat4),
  times=3
  )
# Unit: milliseconds
#      expr          min           lq       median           uq          max neval
# fun(mat1)     2.675568     2.689586     2.703603     2.732487     2.761371     3
# fun(mat2)   272.600480   274.680815   276.761151   276.796217   276.831282     3
# fun(mat3)  4623.875203  4643.634249  4663.393296  4708.067638  4752.741979     3
# fun(mat4) 29041.878164 29047.151348 29052.424532 29235.839275 29419.254017     3
like image 58
Roland Avatar answered Oct 25 '22 13:10

Roland