Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create flag indicating if year variable is in the range of start:end variables in data.table

Tags:

r

data.table

I have a data.table with 3 date variables: year, start, end.

test <- data.table(year=2001:2003,start=c(2003,2002,2000),end=c(2003,2004,2002),x_desired=c(F,T,F))

O want to create a new variable x, indicating, for each row, if year is in the range defined by start and end. The correct desired result is in the variable x_desired.

I imagined this could be done with:

test[,x:=(year %in% start:end)]

but the result is clearly not correct. I wanted the ranges to be defined row-by-row, but don't know how to express that.

like image 634
LucasMation Avatar asked May 27 '19 15:05

LucasMation


People also ask

Is data table DT == true?

data. table(DT) is TRUE. To better description, I put parts of my original code here. So you may understand where goes wrong.

What is data table in R?

data.table is an R package that provides an enhanced version of data.frame s, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread() . We can also create one using the data.table() function.


1 Answers

Another approach

#first, create a x-column with all FALSE
DT[, x := FALSE ]
#update the x-column subset where year is between start and end to TRUE
DT[ year %between% list(start,end), x := TRUE] 

Should run fast... Benchmarks will follow soon

update: benchmarks on a data.table of 1M rows

n = 1000000
set.seed(123)
dt <- data.table(year =sample( 2001:2003, n, replace = TRUE),
                 start=sample( c(2003,2002,2000), n, replace = TRUE),
                 end  =sample( c(2003,2004,2002), n, replace = TRUE) )

microbenchmark::microbenchmark( 
  wimpel = {
    DT <- copy(dt) 
    DT[, x := FALSE ]
    DT[ year %between% list(start,end), x := TRUE] 
    },
  akrun_nrow = {
    DT <- copy(dt)
    DT[, x := between(year, start, end), 1:nrow(DT)]
    },
  akrun_map = {
    DT <- copy(dt)
    DT[, x := unlist(do.call(Map, c(f = between, unname(.SD)))), .SDcols = year:end]
    },
  akrun_pmap = {
    DT <- copy(dt)
    DT[, x := purrr::pmap_lgl(.SD[, .(x = year, left = start, right = end)], between)]
    },
  markus = {
    DT <- copy(dt)
    DT[, col := mapply(between, year, start, end)]
  },
  times = 3
  )

results

Unit: milliseconds
       expr        min         lq       mean     median         uq        max neval
     wimpel   29.98388   30.41861   48.98399   30.85333   58.48404   86.11475     3
 akrun_nrow 2741.35268 2755.01860 2944.58975 2768.68453 3046.20829 3323.73206     3
  akrun_map 3673.21253 3683.22849 3711.51209 3693.24446 3730.66188 3768.07929     3
 akrun_pmap 3281.13335 3291.04689 3406.46131 3300.96043 3469.12528 3637.29013     3
     markus 3408.07869 3569.33044 3670.68141 3730.58219 3801.98277 3873.38334     3

There seems to be a clear winner.. but perhaps I'm missing something here?

like image 111
Wimpel Avatar answered Nov 15 '22 01:11

Wimpel