How to add the condition that there is a minimum amount of data to calculate a variable in `data.table`

Question

I have two dataframes. The first one df summarises detections of several individuals (ID) of a fish species overtime (DateTime). As an example:

options("digits.secs" = 3)

df<- data.frame(DateTime=c("2017-08-05 14:03:55.300","2017-08-05 16:18:12.100","2017-08-05 20:34:31.540","2017-08-05 16:18:14.355","2017-08-05 20:34:33.605"),
    ID= c("A","B","C","B","C"))

df
                 DateTime ID
1 2017-08-05 14:03:55.300  A
2 2017-08-05 16:18:12.100  B
3 2017-08-05 20:34:31.540  C
4 2017-08-05 16:18:14.355  B
5 2017-08-05 20:34:33.605  C

The other dataframe Activity has activity info about those individuals over time. This data is at high temporal resolution. That is, it is 11 data per second (11 hertz). As a reproducible example:

set.seed(100)
fmt <- "%Y-%m-%d %H:%M:%OS"

DateTime = seq(from=as.POSIXct("2017-08-05 14:03:55.100", format=fmt, tz="UTC"), by=1/11, length.out=80)
ID = rep("A", each=80)
x= sample(seq(from = -1, to = 1, by = 0.01), size = 80, replace = TRUE)
y= sample(seq(from = -1, to = 1, by = 0.01), size = 80, replace = TRUE)
z= sample(seq(from = -1, to = 1, by = 0.01), size = 80, replace = TRUE)
Activity1<- data.frame(DateTime,ID, x, y, z)

DateTime = seq(from=as.POSIXct("2017-08-05 16:18:11.900", format=fmt, tz="UTC"),by=1/5, length.out=40)
ID = rep("B", each=40)
x= sample(seq(from = -1, to = 1, by = 0.01), size = 40, replace = TRUE)
y= sample(seq(from = -1, to = 1, by = 0.01), size = 40, replace = TRUE)
z= sample(seq(from = -1, to = 1, by = 0.01), size = 40, replace = TRUE)
Activity2<- data.frame(DateTime,ID, x, y, z)

DateTime = seq(from=as.POSIXct("2017-08-05 16:18:19.703", format=fmt, tz="UTC"),by=1/11, length.out=40)
ID = rep("B", each=40)
x= sample(seq(from = -1, to = 1, by = 0.01), size = 40, replace = TRUE)
y= sample(seq(from = -1, to = 1, by = 0.01), size = 40, replace = TRUE)
z= sample(seq(from = -1, to = 1, by = 0.01), size = 40, replace = TRUE)
Activity3<- data.frame(DateTime,ID, x, y, z)

DateTime = seq(from=as.POSIXct("2017-08-05 20:34:31.240", format=fmt, tz="UTC"),by=1/11, length.out=80)
ID = rep("C", each=80)
x= sample(seq(from = -1, to = 1, by = 0.01), size = 80, replace = TRUE)
y= sample(seq(from = -1, to = 1, by = 0.01), size = 80, replace = TRUE)
z= sample(seq(from = -1, to = 1, by = 0.01), size = 80, replace = TRUE)
Activity4<- data.frame(DateTime,ID, x, y, z)
Activity<- rbind(Activity1,Activity2,Activity3,Activity4)

head(Activity)
                 DateTime ID     x     y     z
1 2017-08-05 14:03:55.099  A  0.01 -0.16 -1.00
2 2017-08-05 14:03:55.190  A  0.11  0.55 -0.69
3 2017-08-05 14:03:55.281  A  0.50  0.79  1.00
4 2017-08-05 14:03:55.372  A  0.97 -0.76  0.24
5 2017-08-05 14:03:55.463  A -0.97 -0.59  0.20
6 2017-08-05 14:03:55.554  A -0.46  0.42 -0.88

I am using the code below to calculate the variables VeDBA and RMS in df. I use data from Activity to calculate them, and I add those variables to the dataframe df. As a summary, for each row of df data, I calculate VeDBA and RMS using 2 seconds of data (which is 22 rows since I have 11 data per second in the dataframe Activity) from the Activity data frame and using the DateTime of the data frame df as the start time.

library(data.table)
setDT(df)[, DateTime := as.POSIXct(DateTime, format=fmt, tz="UTC")][,
    c("start", "end") := .(DateTime, DateTime+2)]
setDT(Activity)[, DateTime := as.POSIXct(DateTime, format=fmt, tz="UTC")]

df<- Activity[df, on=.(ID, DateTime>=start, DateTime<=end),
    by=.EACHI, .(
        DateTime=i.DateTime,
        ID=i.ID, 
        VeDBA=sum(sqrt(x^2 + y^2 + z^2)) / .N,
        RMS=sqrt((sum(x^2) + sum(y^2) + sum(z^2)) / .N))][, 
            (1L:3L) := NULL][]

The problem is that sometimes I have 5 data per second instead of 11 in the dataframe Activity. So I want to include some kind of code that indicates than, for instance, when I have less than 14 data in the mentioned 2 second period, I want to appear NA for VeDBA and RMS in df.

With the mentioned code I get this so far:

df
                  DateTime ID     VeDBA       RMS
1: 2017-08-05 14:03:55.299  A 0.9919576 1.0264458
2: 2017-08-05 16:18:12.099  B 0.9375138 0.9573975
3: 2017-08-05 20:34:31.539  C 0.9294209 0.9764383
4: 2017-08-05 16:18:14.355  B 0.7542922 0.7886634
5: 2017-08-05 20:34:33.605  C 1.0041628 1.0395891

And I would like to get this:

df
                  DateTime ID     VeDBA       RMS
1: 2017-08-05 14:03:55.299  A 0.9919576 1.0264458
2: 2017-08-05 16:18:12.099  B NA        NA        # Between 16:18:12.099 and 16:18:14.099 there is only 10 data instead of 22
3: 2017-08-05 20:34:31.539  C 0.9294209 0.9764383
4: 2017-08-05 16:18:14.355  B NA        NA
5: 2017-08-05 20:34:33.605  C 1.0041628 1.0395891

Does anyone know how to modify the code I am using with data.table to get those NAs?

Uwe · Accepted Answer

The modification below returns NA if .N is below the given threshhold n_min:

n_min <- 14L
Activity[df, on = .(ID, DateTime >= start, DateTime <= end),
         by = .EACHI, .(
           DateTime = i.DateTime,
           ID = i.ID,
           .N,  # inserted just to verify the result, to be omitted in production code
           VeDBA = if (.N < n_min) NA_real_ else sum(sqrt(x ^ 2 + y ^ 2 + z ^ 2)) / .N,
           RMS = if (.N < n_min) NA_real_ else sqrt((sum(x ^ 2) + sum(y ^ 2) + sum(z ^ 2)) / .N)
         )][,
            (1L:3L) := NULL][]

                  DateTime ID  N     VeDBA       RMS
1: 2017-08-05 14:03:55.299  A 22 0.8777660 0.9181305
2: 2017-08-05 16:18:12.099  B 10        NA        NA
3: 2017-08-05 20:34:31.539  C 22 0.8807835 0.9383084
4: 2017-08-05 16:18:14.355  B 10        NA        NA
5: 2017-08-05 20:34:33.605  C 22 1.0587765 1.1023549

Please, note that column N has been inserted just to verify the result. Also note that the output differs from OP's expected result although set.seed(100) was used to create the data.

if() can be used here because there is only one .N value per .EACHI group.

How to add the condition that there is a minimum amount of data to calculate a variable in `data.table`

Tags:

r

data.table

Dekike

1 Answers

Uwe

Recent Activity

Donate For Us

How to add the condition that there is a minimum amount of data to calculate a variable in `data.table`

Tags:

r

data.table

Dekike

1 Answers

Uwe

Related questions

Recent Activity

Donate For Us