I wish to count different things by id, and by order (time). For example, with:
dt = data.table( id=c(1,1,1,2,2,2,3,3,3), hour=c(1,5,5,6,7,8,23,23,23), ip=c(1,1,45,2,2,2,3,1,1), target=c(1,0,0,1,1,1,1,1,0), day=c(1,1,1,1,1,1,3,2,1))
id hour ip target day
1: 1 1 1 1 1
2: 1 5 1 0 1
3: 1 5 45 0 1
4: 2 6 2 1 1
5: 2 7 2 1 1
6: 2 8 2 1 1
7: 3 23 3 1 3
8: 3 23 1 1 2
9: 3 23 1 0 1
I wish to count, for each id, the number of active days, and active hours, so far, for each row. Which means I wish to get the following output:
id hour ip target day nb_active_hours_so_far
1: 1 1 1 1 1 0 (first occurence of id when ordered by hour)
2: 1 5 1 0 1 1 (has been active in hour "1")
3: 1 5 45 0 1 2 (has been active in hour "1" and "5")
4: 2 6 2 1 1 0 (first occurence)
5: 2 7 2 1 1 1 (has been active in hour "6")
6: 2 8 2 1 1 2 (has been active in hour "6" and "7")
7: 3 23 3 1 3 0 (first occurence)
8: 3 23 1 1 2 1 (has been active in hour "23")
9: 3 23 1 0 1 1 (has been active in hour "23" only)
To get the total count of active hours I would do:
dt[, nb_active_hours := length(unique(hour)), by=id]
but I want to have the so far part as well. I do not know how to do that... Any help would be appreciated.
This is seem to work (though havn't tested on different cases)
dt[, nb_active_hours_so_far := cumsum(c(0:1, diff(hour[-.N]))>0), by = id]
# id hour ip target day temp nb_active_hours_so_far
# 1: 1 1 1 1 1 0 0
# 2: 1 5 1 0 1 1 1
# 3: 1 5 45 0 1 1 2
# 4: 2 6 2 1 1 0 0
# 5: 2 7 2 1 1 1 1
# 6: 2 8 2 1 1 2 2
# 7: 3 23 3 1 3 0 0
# 8: 3 23 1 1 2 0 1
# 9: 3 23 1 0 1 0 1
Yerk. I have this ugly solution:
library(data.table)
dt[ ,nb_active_hours_so_far:=c(0,head(cumsum(c(1,diff(hour)>0)), -1)),id][]
# id hour ip target day nb_active_hours_so_far
#1: 1 1 1 1 1 0
#2: 1 5 1 0 1 1
#3: 1 5 45 0 1 2
#4: 2 6 2 1 1 0
#5: 2 7 2 1 1 1
#6: 2 8 2 1 1 2
#7: 3 23 3 1 3 0
#8: 3 23 1 1 2 1
#9: 3 23 1 0 1 1
Or you could make use of the functions rleid/shift
from the devel version of data.table
, i.e. v1.9.5
. Instructions to install the devel version are here
. (Thanks to @Frank for the shift
)
library(data.table)
dt[,nb_active_hours_so_far := shift(rleid(hour),fill=0L), id]
# id hour ip target day nb_active_hours_so_far
#1: 1 1 1 1 1 0
#2: 1 5 1 0 1 1
#3: 1 5 45 0 1 2
#4: 2 6 2 1 1 0
#5: 2 7 2 1 1 1
#6: 2 8 2 1 1 2
#7: 3 23 3 1 3 0
#8: 3 23 1 1 2 1
#9: 3 23 1 0 1 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With