Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute date difference between rows considering different groups inside dataframe

Tags:

r

dplyr

I am working with some information from treatments applied to animals in R. First I would like to describe the structure of my information (I will add the dput() version in the end). My data is DF and it looks like this:

  Treatment_ID Start_Date      Valid
1         0031 2011-05-01 2011-05-30
2         0031 2011-05-01 2011-06-30
3         0045 2012-02-01 2012-03-01
4         0057 2012-04-01 2012-04-30
5         0057 2012-04-01 2012-05-30
6         0098 2012-10-01 2012-10-30

It has 56 rows and three variables Treatment_ID (5 types of treatments), Start_Date (Date of start of treatment) and Valid (Final date of the treatment). For example the Treatment_ID 0031 has two observations because this started on May 2011 and finished on June 2011. Then a new treatment 0045started on February 2012 and finished on March 2012 (Only one observation). The same structure applies for all the groups inside DF. I need to compute the difference of months between each treatment and behind each treatment using some conditions. I will use the first two treatments to show this:

  Treatment_ID Start_Date      Valid
1         0031 2011-05-01 2011-05-30
2         0031 2011-05-01 2011-06-30
3         0045 2012-02-01 2012-03-01
4         0057 2012-04-01 2012-04-30  

For this example, I have two rows for the first treatment where the Treatment_Id variable is equal. When that occurs the difference in months for the variable Valid must be calculated. When a new treatment appears then the difference in months between Start_Date and Valid must be calculated. Notice that when a treatment has more than one observation the difference is gotten by using Valid variable for the observations in that group but when the Treatment_Id changes therefore the difference must be obtained by using Start_Date and Valid variables. To get this variable Break_Months I used the next structure:

DF$Break_Months=NA

for(i in c(2:(length(DF$Break_Months))))
{
  DF$Break_Months[i]=ifelse(DF$Treatment_ID[i]==DF$Treatment_ID[i-1],round(as.numeric(DF$Valid[i]-DF$Valid[i-1])/30,0),
                            round(as.numeric(DF$Start_Date[i]-DF$Valid[i-1])/30,0))
}

This for when the Treatment_Id are equal computes the difference between the actual row and the previous with Valid variable and when they are different the difference is computed using Start_Date and Valid. The first value of Break_Months is NA because there is not a previous value to compare. The problem appeared in the end of DF when I used the previous lines of code.

   Treatment_ID Start_Date      Valid Break_Months
47         0098 2012-10-01 2016-07-30            1
48         0098 2012-10-01 2016-08-31            1
49         0031 2016-09-01 2016-09-30            0
50         0031 2016-09-01 2016-10-30            1
51         0031 2016-09-01 2016-11-30            1
52         0031 2016-09-01 2016-12-30            1
53         0031 2016-09-01 2017-01-30            1
54         0031 2016-09-01 2017-03-02            1
55         0031 2016-09-01 2017-03-30            1
56         0012 2017-03-01 2017-03-30           -1

The Treatment_Id 0012 has only one observation because it is new and the Valid date is the same that the last observation of treatment 0031. Because the Treatment_Id 0031 has been used in other months then difference is computed with the previous observations inside the treatment. In the case of 0012 this is not possible because the Valid date of this is the same that the last observation of 0031 and 0012 does not have more observations because it is new. When that happens the comparison must be done with the last observation of the group previous to 0031 this is 0098. By using the concept due to 0012 is not equal to 0098 the Break_Months is computed by the difference between 2017-03-01 (Start_Date) and 2016-08-31 (Valid) giving a value of 6 by the same mechanic in the for structure and not -1.

My question is related to how to incorporate this consideration inside the for. It has been very complex to try to do this because I do not know how to integrate a comparison related to the date (if they are equal as in he example) and looking for the previous group before the one that contains the same date. I have tried using lag function from dplyr package to avoid the for but the results are not the same. The dput() version of DF is the next:

DF<-structure(list(Treatment_ID = c("0031", "0031", "0045", "0057", 
"0057", "0098", "0098", "0098", "0098", "0098", "0098", "0098", 
"0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", 
"0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", 
"0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", 
"0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", 
"0098", "0098", "0098", "0098", "0031", "0031", "0031", "0031", 
"0031", "0031", "0031", "0012"), Start_Date = structure(c(1304208000, 
1304208000, 1328054400, 1333238400, 1333238400, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1472688000, 
1472688000, 1472688000, 1472688000, 1472688000, 1472688000, 1472688000, 
1488326400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Valid = structure(c(1306713600, 1309392000, 1330560000, 1335744000, 
    1338336000, 1351555200, 1354233600, 1356825600, 1359504000, 
    1362182400, 1364601600, 1367280000, 1369872000, 1372550400, 
    1375142400, 1377820800, 1380499200, 1383091200, 1385769600, 
    1388361600, 1391040000, 1393718400, 1396137600, 1398816000, 
    1401408000, 1404086400, 1412035200, 1414627200, 1417305600, 
    1419897600, 1422576000, 1425254400, 1427673600, 1432944000, 
    1435622400, 1440892800, 1443571200, 1446163200, 1448841600, 
    1451433600, 1454112000, 1456790400, 1459296000, 1461974400, 
    1464566400, 1467244800, 1469836800, 1472601600, 1475193600, 
    1477785600, 1480464000, 1483056000, 1485734400, 1488412800, 
    1490832000, 1490832000), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("Treatment_ID", 
"Start_Date", "Valid"), row.names = c(NA, -56L), class = "data.frame")

Thanks for your help.

Update One of these solutions worked perfect. Now I have a little issue when I have to compute a similar variables. First I compute the variable Elapsed that is the difference between Valid and Start_Date. I use the next code:

DF$Elapsed=round(as.numeric(DF$Valid-DF$Start_Date)/30,0)

Then, the dilemma appears. I have to compute the next two variables Last1 and Last2. I use the next code for this:

#Compute Last1
DF$Last1=NA
DF$Last1[1]=0
for(j in c(2:length(DF$Last1)))
{
  DF$Last1[j]=ifelse(DF$Treatment_ID[j]==DF$Treatment_ID[j-1],DF$Last1[j-1],
                     ifelse(DF$Treatment_ID[j]!=DF$Treatment_ID[j-1],DF$Elapsed[j-1],0))
}

The code worked parcially because I have a similar problem related to the variable Break_Months. In this case because 0031 and 0012 have the same Valid value the comparison of Treatment_Id must not be done with the last value of 0031 where 7 is assigned due to the logic of the loop (variable Elapsed). In this case the correct value is 48 because the comparison must be structured with the last observation of group 0098 then we get the 48. I have tried to modify with last_obs_index but I can not get the correct result.

   Treatment_ID Start_Date      Valid Break_Months Elapsed Last1
47         0098 2012-10-01 2016-07-30            1      47     2
48         0098 2012-10-01 2016-08-31            1      48     2
49         0031 2016-09-01 2016-09-30            0       1    48
50         0031 2016-09-01 2016-10-30            1       2    48
51         0031 2016-09-01 2016-11-30            1       3    48
52         0031 2016-09-01 2016-12-30            1       4    48
53         0031 2016-09-01 2017-01-30            1       5    48
54         0031 2016-09-01 2017-03-02            1       6    48
55         0031 2016-09-01 2017-03-30            1       7    48
56         0012 2017-03-01 2017-03-30            6       1     7

For the variable Last2 I use the next code:

#Compute Last2
DF$Last2=NA
DF$Last2[1]=0
for(k in c(2:length(DF$Last2)))
{
  DF$Last2[k]=ifelse(DF$Treatment_ID[k]==DF$Treatment_ID[k-1],DF$Last2[k-1],
                     ifelse(DF$Treatment_ID[k]!=DF$Treatment_ID[k-1],DF$Break_Months[k],0))
}

In this case it seems that worked but that is not true. Despite the fact that 6 is correct, the comparison is not well defined because 0012 and 0031 has the same Valid date and the optimal comparison is using the last observation of 0098 group.Therefore, the value of Break_Months is assigned. Again I could not fix the loop with the proper logic defined with last_obs_index.

   Treatment_ID Start_Date      Valid Break_Months Elapsed Last1 Last2
47         0098 2012-10-01 2016-07-30            1      47     2     4
48         0098 2012-10-01 2016-08-31            1      48     2     4
49         0031 2016-09-01 2016-09-30            0       1    48     0
50         0031 2016-09-01 2016-10-30            1       2    48     0
51         0031 2016-09-01 2016-11-30            1       3    48     0
52         0031 2016-09-01 2016-12-30            1       4    48     0
53         0031 2016-09-01 2017-01-30            1       5    48     0
54         0031 2016-09-01 2017-03-02            1       6    48     0
55         0031 2016-09-01 2017-03-30            1       7    48     0
56         0012 2017-03-01 2017-03-30            6       1     7     6

Thanks for all the help this time, would it be possible to get advice about how to adapt the loops to get the comparisons in the right way.

like image 769
Duck Avatar asked Oct 18 '22 11:10

Duck


1 Answers

Here's a method making use of a few extra dplyr functions like lag and if_else. It works out the comparison date (kept here for a sanity check), subtracts the previous Valid date from this, and then converts to rounded number of "months" (30 day periods).

library(dplyr)

  mutate(DF,
    comparison_date = if_else(Treatment_ID == lag(Treatment_ID), Valid, Start_Date),
    Break_Months = difftime(comparison_date, lag(Valid), units = "days"),
    Break_Months = as.numeric(round(Break_Months / 30)))

#>    Treatment_ID Start_Date      Valid comparison_date Break_Months
#> 1          0031 2011-05-01 2011-05-30            <NA>           NA
#> 2          0031 2011-05-01 2011-06-30      2011-06-30            1
#> 3          0045 2012-02-01 2012-03-01      2012-02-01            7
#> 4          0057 2012-04-01 2012-04-30      2012-04-01            1
#> 5          0057 2012-04-01 2012-05-30      2012-05-30            1
#> 6          0098 2012-10-01 2012-10-30      2012-10-01            4
#> 7          0098 2012-10-01 2012-11-30      2012-11-30            1
#> 8          0098 2012-10-01 2012-12-30      2012-12-30            1
#> 9          0098 2012-10-01 2013-01-30      2013-01-30            1
#> 10         0098 2012-10-01 2013-03-02      2013-03-02            1
...
like image 105
Simon Jackson Avatar answered Oct 21 '22 07:10

Simon Jackson