Row conditional column operations in data.table

Tags:

I have a large data.table where I for each row need to make computations based on part of the full data.table. As an example consider the following data.table, and assume I for each row want to compute the sum of the num variable for every rows where id2 matches id1 for the current row as well as the time variable is within distance 1 from the time of the current row.

set.seed(123)

dat <- data.table(cbind(id1=sample(1:5,10,replace=T),
                        id2=sample(1:5,10,replace=T),
                        num=sample(1:10,10,replace=T),
                        time=sample(1:10,10,replace=T)))

This could easily be done by looping over each row like this

dat[,val:= 0]
for (i in 1:nrow(dat)){
  this.val <- dat[ (id2==id1[i]) & (time>=time[i]-2) & (time<=time[i]+2),sum(num)]
  dat[i,val:=this.val]
}

dat

The resulting data.table looks like this:

   > dat
        id1 id2 num time val
     1:   2   5   9   10   6
     2:   4   3   7   10   0
     3:   3   4   7    7  10
     4:   5   3  10    8   9
     5:   5   1   7    1   2
     6:   1   5   8    5   6
     7:   3   2   6    8  17
     8:   5   1   6    3  10
     9:   3   2   3    4   0
    10:   3   5   2    3   0

What is the proper/fast way to do things like this using data.table?

977

asked Jan 04 '18 09:01

Mark

1 Answers

We can use a self-join here by creating the 'timeminus2' and 'timeplus2' column, join on by 'id2' with 'id1' and the non-equi logical condition to get the sum of 'num' and assign (:=) the 'val' column to the original dataset

tmp <- dat[.(id1 = id1, timeminus2 = time - 2, timeplus2 = time + 2), 
             .(val = sum(num)),
             on = .(id2 = id1, time >= timeminus2, time <= timeplus2),
             by = .EACHI
         ][is.na(val), val := 0][]
dat[, val := tmp$val][]
#     id1 id2 num time val
# 1:   2   5   9   10   6
# 2:   4   3   7   10   0
# 3:   3   4   7    7  10
# 4:   5   3  10    8   9
# 5:   5   1   7    1   2
# 6:   1   5   8    5   6
# 7:   3   2   6    8  17
# 8:   5   1   6    3  10
# 9:   3   2   3    4   0
#10:   3   5   2    3   0

answered Oct 16 '22 22:10

akrun

Related questions
                            
                                Error : length of 'dimnames' [2] not equal to array extent [closed]
                            
                                multi-computer makePSOCKcluster on Windows: Building a step-by-step guide
                            
                                Using SparkR JVM to call methods from a Scala jar file
                            
                                how to authenticate a shibboleth multi-hostname website with httr in R
                            
                                Tidy evaluation programming and ggplot2
                            
                                Setting the default PDF viewer for rstudio
                            
                                How to understand RandomForestExplainer output (R package)
                            
                                reshaping prediction data efficiently using data.table in R
                            
                                How to extrapolate a raster using in R
                            
                                pyomo + reticulate error 6 the handle is invalid
                            
                                Fastest Tall-Wide pivoting in R
                            
                                Is there a way to set up a multi-hidden layer neural network with the mlp method in the caret package?
                            
                                How to separate edge label from edge in igraph?
                            
                                Memory leak when using package XML on Windows
                            
                                Problems with opening RStudio
                            
                                Plus sign between ggplot2 and other function (R) [duplicate]
                            
                                Increase space between legend title and labels in ggplot2
                            
                                glmer logit - interaction effects on probability scale (replicating `effects` with `predict`)
                            
                                Can I run an SQL update statement using only dplyr syntax in R
                            
                                R RKEA - Not enough training instances with class labels (required: 1, provided: 0)!

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Row conditional column operations in data.table

Tags:

r

data.table

Mark

People also ask

1 Answers

akrun

Recent Activity

Donate For Us