I have two dataframes. The first one df summarises detections of several individuals (ID) of a fish species overtime (DateTime). As an example:
options("digits.secs" = 3)
df<- data.frame(DateTime=c("2017-08-05 14:03:55.300","2017-08-05 16:18:12.100","2017-08-05 20:34:31.540","2017-08-05 16:18:14.355","2017-08-05 20:34:33.605"),
ID= c("A","B","C","B","C"))
df
DateTime ID
1 2017-08-05 14:03:55.300 A
2 2017-08-05 16:18:12.100 B
3 2017-08-05 20:34:31.540 C
4 2017-08-05 16:18:14.355 B
5 2017-08-05 20:34:33.605 C
The other dataframe Activity has activity info about those individuals over time. This data is at high temporal resolution. That is, it is 11 data per second (11 hertz). As a reproducible example:
set.seed(100)
fmt <- "%Y-%m-%d %H:%M:%OS"
DateTime = seq(from=as.POSIXct("2017-08-05 14:03:55.100", format=fmt, tz="UTC"), by=1/11, length.out=80)
ID = rep("A", each=80)
x= sample(seq(from = -1, to = 1, by = 0.01), size = 80, replace = TRUE)
y= sample(seq(from = -1, to = 1, by = 0.01), size = 80, replace = TRUE)
z= sample(seq(from = -1, to = 1, by = 0.01), size = 80, replace = TRUE)
Activity1<- data.frame(DateTime,ID, x, y, z)
DateTime = seq(from=as.POSIXct("2017-08-05 16:18:11.900", format=fmt, tz="UTC"),by=1/5, length.out=40)
ID = rep("B", each=40)
x= sample(seq(from = -1, to = 1, by = 0.01), size = 40, replace = TRUE)
y= sample(seq(from = -1, to = 1, by = 0.01), size = 40, replace = TRUE)
z= sample(seq(from = -1, to = 1, by = 0.01), size = 40, replace = TRUE)
Activity2<- data.frame(DateTime,ID, x, y, z)
DateTime = seq(from=as.POSIXct("2017-08-05 16:18:19.703", format=fmt, tz="UTC"),by=1/11, length.out=40)
ID = rep("B", each=40)
x= sample(seq(from = -1, to = 1, by = 0.01), size = 40, replace = TRUE)
y= sample(seq(from = -1, to = 1, by = 0.01), size = 40, replace = TRUE)
z= sample(seq(from = -1, to = 1, by = 0.01), size = 40, replace = TRUE)
Activity3<- data.frame(DateTime,ID, x, y, z)
DateTime = seq(from=as.POSIXct("2017-08-05 20:34:31.240", format=fmt, tz="UTC"),by=1/11, length.out=80)
ID = rep("C", each=80)
x= sample(seq(from = -1, to = 1, by = 0.01), size = 80, replace = TRUE)
y= sample(seq(from = -1, to = 1, by = 0.01), size = 80, replace = TRUE)
z= sample(seq(from = -1, to = 1, by = 0.01), size = 80, replace = TRUE)
Activity4<- data.frame(DateTime,ID, x, y, z)
Activity<- rbind(Activity1,Activity2,Activity3,Activity4)
head(Activity)
DateTime ID x y z
1 2017-08-05 14:03:55.099 A 0.01 -0.16 -1.00
2 2017-08-05 14:03:55.190 A 0.11 0.55 -0.69
3 2017-08-05 14:03:55.281 A 0.50 0.79 1.00
4 2017-08-05 14:03:55.372 A 0.97 -0.76 0.24
5 2017-08-05 14:03:55.463 A -0.97 -0.59 0.20
6 2017-08-05 14:03:55.554 A -0.46 0.42 -0.88
I am using the code below to calculate the variables VeDBA and RMS in df. I use data from Activity to calculate them, and I add those variables to the dataframe df. As a summary, for each row of df data, I calculate VeDBA and RMS using 2 seconds of data (which is 22 rows since I have 11 data per second in the dataframe Activity) from the Activity data frame and using the DateTime of the data frame df as the start time.
library(data.table)
setDT(df)[, DateTime := as.POSIXct(DateTime, format=fmt, tz="UTC")][,
c("start", "end") := .(DateTime, DateTime+2)]
setDT(Activity)[, DateTime := as.POSIXct(DateTime, format=fmt, tz="UTC")]
df<- Activity[df, on=.(ID, DateTime>=start, DateTime<=end),
by=.EACHI, .(
DateTime=i.DateTime,
ID=i.ID,
VeDBA=sum(sqrt(x^2 + y^2 + z^2)) / .N,
RMS=sqrt((sum(x^2) + sum(y^2) + sum(z^2)) / .N))][,
(1L:3L) := NULL][]
The problem is that sometimes I have 5 data per second instead of 11 in the dataframe Activity. So I want to include some kind of code that indicates than, for instance, when I have less than 14 data in the mentioned 2 second period, I want to appear NA for VeDBA and RMS in df.
With the mentioned code I get this so far:
df
DateTime ID VeDBA RMS
1: 2017-08-05 14:03:55.299 A 0.9919576 1.0264458
2: 2017-08-05 16:18:12.099 B 0.9375138 0.9573975
3: 2017-08-05 20:34:31.539 C 0.9294209 0.9764383
4: 2017-08-05 16:18:14.355 B 0.7542922 0.7886634
5: 2017-08-05 20:34:33.605 C 1.0041628 1.0395891
And I would like to get this:
df
DateTime ID VeDBA RMS
1: 2017-08-05 14:03:55.299 A 0.9919576 1.0264458
2: 2017-08-05 16:18:12.099 B NA NA # Between 16:18:12.099 and 16:18:14.099 there is only 10 data instead of 22
3: 2017-08-05 20:34:31.539 C 0.9294209 0.9764383
4: 2017-08-05 16:18:14.355 B NA NA
5: 2017-08-05 20:34:33.605 C 1.0041628 1.0395891
Does anyone know how to modify the code I am using with data.table to get those NAs?
The modification below returns NA if .N is below the given threshhold n_min:
n_min <- 14L
Activity[df, on = .(ID, DateTime >= start, DateTime <= end),
by = .EACHI, .(
DateTime = i.DateTime,
ID = i.ID,
.N, # inserted just to verify the result, to be omitted in production code
VeDBA = if (.N < n_min) NA_real_ else sum(sqrt(x ^ 2 + y ^ 2 + z ^ 2)) / .N,
RMS = if (.N < n_min) NA_real_ else sqrt((sum(x ^ 2) + sum(y ^ 2) + sum(z ^ 2)) / .N)
)][,
(1L:3L) := NULL][]
DateTime ID N VeDBA RMS 1: 2017-08-05 14:03:55.299 A 22 0.8777660 0.9181305 2: 2017-08-05 16:18:12.099 B 10 NA NA 3: 2017-08-05 20:34:31.539 C 22 0.8807835 0.9383084 4: 2017-08-05 16:18:14.355 B 10 NA NA 5: 2017-08-05 20:34:33.605 C 22 1.0587765 1.1023549
Please, note that column N has been inserted just to verify the result. Also note that the output differs from OP's expected result although set.seed(100) was used to create the data.
if() can be used here because there is only one .N value per .EACHI group.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With