Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Row conditional column operations in data.table

Tags:

r

data.table

I have a large data.table where I for each row need to make computations based on part of the full data.table. As an example consider the following data.table, and assume I for each row want to compute the sum of the num variable for every rows where id2 matches id1 for the current row as well as the time variable is within distance 1 from the time of the current row.

set.seed(123)

dat <- data.table(cbind(id1=sample(1:5,10,replace=T),
                        id2=sample(1:5,10,replace=T),
                        num=sample(1:10,10,replace=T),
                        time=sample(1:10,10,replace=T)))

This could easily be done by looping over each row like this

dat[,val:= 0]
for (i in 1:nrow(dat)){
  this.val <- dat[ (id2==id1[i]) & (time>=time[i]-2) & (time<=time[i]+2),sum(num)]
  dat[i,val:=this.val]
}

dat

The resulting data.table looks like this:

   > dat
        id1 id2 num time val
     1:   2   5   9   10   6
     2:   4   3   7   10   0
     3:   3   4   7    7  10
     4:   5   3  10    8   9
     5:   5   1   7    1   2
     6:   1   5   8    5   6
     7:   3   2   6    8  17
     8:   5   1   6    3  10
     9:   3   2   3    4   0
    10:   3   5   2    3   0

What is the proper/fast way to do things like this using data.table?

like image 977
Mark Avatar asked Jan 04 '18 09:01

Mark


People also ask

Is data table DT == true?

data. table(DT) is TRUE. To better description, I put parts of my original code here. So you may understand where goes wrong.

What is data table in Excel?

A data table is a range of cells in which you can change values in some of the cells and come up with different answers to a problem. A good example of a data table employs the PMT function with different loan amounts and interest rates to calculate the affordable amount on a home mortgage loan.

What is a data table?

The data table is perhaps the most basic building block of business intelligence. In its simplest form, it consists of a series of columns and rows that intersect in cells, plus a header row in which the names of the columns are stated, to make the content of the table understandable to the end user.

Why is data table faster than dplyr?

table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas .


1 Answers

We can use a self-join here by creating the 'timeminus2' and 'timeplus2' column, join on by 'id2' with 'id1' and the non-equi logical condition to get the sum of 'num' and assign (:=) the 'val' column to the original dataset

tmp <- dat[.(id1 = id1, timeminus2 = time - 2, timeplus2 = time + 2), 
             .(val = sum(num)),
             on = .(id2 = id1, time >= timeminus2, time <= timeplus2),
             by = .EACHI
         ][is.na(val), val := 0][]
dat[, val := tmp$val][]
#     id1 id2 num time val
# 1:   2   5   9   10   6
# 2:   4   3   7   10   0
# 3:   3   4   7    7  10
# 4:   5   3  10    8   9
# 5:   5   1   7    1   2
# 6:   1   5   8    5   6
# 7:   3   2   6    8  17
# 8:   5   1   6    3  10
# 9:   3   2   3    4   0
#10:   3   5   2    3   0
like image 99
akrun Avatar answered Oct 16 '22 22:10

akrun