Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Omitting the rows of a data frame in which their elements are the same

Tags:

r

rcpp

Let's say we have a data frame like this

DataFrame ref = DataFrame::create( Named("sender") = sender , Named("receiver") = receiver);

the corresponding R code is as follows:

edge <- as.data.frame(edge) %>%
set_colnames(c("time", "sender", "receiver"))
edge <- rbind(c(0,0,0), edge)
ref  <- data.frame(sender = rep(1:n, times = n),
                receiver = rep(1:n, each = n)

 ) %>%
filter(sender != receiver) %>%
mutate(teller = 1:(n*(n-1))) 

Some rows in this data frame have the same elements, say, 2 2, I want to find them and remove them from the data frame. Then I want to add another column to this data frame which is like numbers from 1 to the number of row of the new data frame.

Example:

Please see here

like image 771
Mori Avatar asked Feb 04 '23 15:02

Mori


1 Answers

I think this question could be construed as a duplicate of this Stack Overflow question, but I answer here separately to demonstrate my point in the comments, that if you're doing this for performance gains, Rcpp may not be the way to go for this particular task. That is, there are many tasks where Rcpp would be my go to for increased performance, but subsetting rows of data frames is not one of those tasks.

The code is pretty easy to set up, following the approach from the answer I linked:

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::DataFrame foo(Rcpp::DataFrame x) {
    Rcpp::NumericVector sender = x["sender"];
    Rcpp::NumericVector receiver = x["receiver"];
    Rcpp::LogicalVector indices = sender != receiver;
    return Rcpp::DataFrame::create(Rcpp::Named("sender") = sender[indices],
                                   Rcpp::Named("receiver") = receiver[indices]);
}

But, we can see that the speed of execution of this is actually worse than base R (and data.table can slightly edge out the performance of base R):

library(dplyr)
library(Rcpp)
library(microbenchmark)
library(data.table)

sourceCpp("so.cpp")

for ( n in 10^(1:3) ) {
    ref  <- data.frame(sender = rep(1:n, times = n),  ## If you're using
                       receiver = rep(1:n, each = n)) ## data frames
    refDT <- setDT(ref) ## If you're using data.table
    cat("For n =", n, "(a data frame with", nrow(ref), "rows)\n")
    print(microbenchmark(base = ref[ref$sender != ref$receiver, ],
                         dplyr = ref %>% filter(sender != receiver),
                         rcpp = foo(ref),
                         data.table = refDT[sender != receiver]))
    cat("\n")
}
For n = 10 (a data frame with 100 rows)
Unit: microseconds
       expr     min       lq     mean   median       uq     max neval
       base 123.917 140.0025 160.7615 155.1905 170.7825 302.520   100
      dplyr 397.308 430.7595 478.0543 446.9185 492.5705 900.716   100
       rcpp 189.473 212.9530 238.8270 223.3305 240.7950 461.452   100
 data.table 122.436 135.9185 160.6607 154.0565 166.7825 460.739   100

For n = 100 (a data frame with 10000 rows)
Unit: microseconds
       expr     min       lq     mean   median       uq     max neval
       base 205.978 224.9760 250.7321 244.3315 265.5060 510.079   100
      dplyr 519.276 581.4535 629.2837 615.7095 662.8060 989.698   100
       rcpp 369.276 430.3510 463.1586 471.3195 486.4450 736.907   100
 data.table 198.012 221.8445 248.9371 246.2385 267.5325 341.935   100

For n = 1000 (a data frame with 1000000 rows)
Unit: milliseconds
       expr       min        lq      mean    median        uq      max
       base  6.535990  6.892702  7.664697  7.203983  7.554144 11.42160
      dplyr  8.795884  9.239173 10.024997  9.618395  9.992066 15.04914
       rcpp 15.116928 15.598556 17.164895 16.216766 17.066418 30.45578
 data.table  6.624728  6.905202  7.543284  7.137171  7.482922 11.67061
 neval
   100
   100
   100
   100
like image 178
duckmayr Avatar answered May 13 '23 03:05

duckmayr