Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data table - calculation for each row using all rows before current row

Tags:

r

data.table

I wish to count different things by id, and by order (time). For example, with:

dt = data.table( id=c(1,1,1,2,2,2,3,3,3), hour=c(1,5,5,6,7,8,23,23,23), ip=c(1,1,45,2,2,2,3,1,1), target=c(1,0,0,1,1,1,1,1,0), day=c(1,1,1,1,1,1,3,2,1))

   id hour ip target day
1:  1    1  1      1   1
2:  1    5  1      0   1
3:  1    5 45      0   1
4:  2    6  2      1   1
5:  2    7  2      1   1
6:  2    8  2      1   1
7:  3   23  3      1   3
8:  3   23  1      1   2
9:  3   23  1      0   1

I wish to count, for each id, the number of active days, and active hours, so far, for each row. Which means I wish to get the following output:

   id hour ip target day  nb_active_hours_so_far
1:  1    1  1      1   1  0  (first occurence of id when ordered by hour)
2:  1    5  1      0   1  1  (has been active in hour "1")
3:  1    5 45      0   1  2  (has been active in hour "1" and "5")
4:  2    6  2      1   1  0  (first occurence)
5:  2    7  2      1   1  1  (has been active in hour "6")
6:  2    8  2      1   1  2  (has been active in hour "6" and "7")
7:  3   23  3      1   3  0  (first occurence)
8:  3   23  1      1   2  1  (has been active in hour "23")
9:  3   23  1      0   1  1  (has been active in hour "23" only)

To get the total count of active hours I would do:

dt[, nb_active_hours := length(unique(hour)), by=id]

but I want to have the so far part as well. I do not know how to do that... Any help would be appreciated.

like image 727
Timothée HENRY Avatar asked Jun 29 '15 10:06

Timothée HENRY


3 Answers

This is seem to work (though havn't tested on different cases)

dt[, nb_active_hours_so_far := cumsum(c(0:1, diff(hour[-.N]))>0), by = id]
#    id hour ip target day temp nb_active_hours_so_far
# 1:  1    1  1      1   1    0                      0
# 2:  1    5  1      0   1    1                      1
# 3:  1    5 45      0   1    1                      2
# 4:  2    6  2      1   1    0                      0
# 5:  2    7  2      1   1    1                      1
# 6:  2    8  2      1   1    2                      2
# 7:  3   23  3      1   3    0                      0
# 8:  3   23  1      1   2    0                      1
# 9:  3   23  1      0   1    0                      1
like image 177
David Arenburg Avatar answered Oct 13 '22 19:10

David Arenburg


Yerk. I have this ugly solution:

library(data.table)
dt[ ,nb_active_hours_so_far:=c(0,head(cumsum(c(1,diff(hour)>0)), -1)),id][]

#   id hour ip target day nb_active_hours_so_far
#1:  1    1  1      1   1                      0
#2:  1    5  1      0   1                      1
#3:  1    5 45      0   1                      2
#4:  2    6  2      1   1                      0
#5:  2    7  2      1   1                      1
#6:  2    8  2      1   1                      2
#7:  3   23  3      1   3                      0
#8:  3   23  1      1   2                      1
#9:  3   23  1      0   1                      1
like image 33
Colonel Beauvel Avatar answered Oct 13 '22 19:10

Colonel Beauvel


Or you could make use of the functions rleid/shift from the devel version of data.table, i.e. v1.9.5. Instructions to install the devel version are here. (Thanks to @Frank for the shift)

 library(data.table)
 dt[,nb_active_hours_so_far := shift(rleid(hour),fill=0L), id]
 #   id hour ip target day nb_active_hours_so_far
 #1:  1    1  1      1   1                      0
 #2:  1    5  1      0   1                      1
 #3:  1    5 45      0   1                      2
 #4:  2    6  2      1   1                      0
 #5:  2    7  2      1   1                      1
 #6:  2    8  2      1   1                      2
 #7:  3   23  3      1   3                      0
 #8:  3   23  1      1   2                      1
 #9:  3   23  1      0   1                      1
like image 31
akrun Avatar answered Oct 13 '22 21:10

akrun