I am working with some information from treatments applied to animals in R
. First I would like to describe the structure of my information (I will add the dput()
version in the end). My data is DF
and it looks like this:
Treatment_ID Start_Date Valid
1 0031 2011-05-01 2011-05-30
2 0031 2011-05-01 2011-06-30
3 0045 2012-02-01 2012-03-01
4 0057 2012-04-01 2012-04-30
5 0057 2012-04-01 2012-05-30
6 0098 2012-10-01 2012-10-30
It has 56 rows and three variables Treatment_ID
(5 types of treatments), Start_Date
(Date of start of treatment) and Valid
(Final date of the treatment). For example the Treatment_ID
0031
has two observations because this started on May 2011 and finished on June 2011. Then a new treatment 0045
started on February 2012 and finished on March 2012 (Only one observation). The same structure applies for all the groups inside DF
. I need to compute the difference of months between each treatment and behind each treatment using some conditions. I will use the first two treatments to show this:
Treatment_ID Start_Date Valid
1 0031 2011-05-01 2011-05-30
2 0031 2011-05-01 2011-06-30
3 0045 2012-02-01 2012-03-01
4 0057 2012-04-01 2012-04-30
For this example, I have two rows for the first treatment where the Treatment_Id
variable is equal. When that occurs the difference in months for the variable Valid
must be calculated. When a new treatment appears then the difference in months between Start_Date
and Valid
must be calculated. Notice that when a treatment has more than one observation the difference is gotten by using Valid
variable for the observations in that group but when the Treatment_Id
changes therefore the difference must be obtained by using Start_Date
and Valid
variables. To get this variable Break_Months
I used the next structure:
DF$Break_Months=NA
for(i in c(2:(length(DF$Break_Months))))
{
DF$Break_Months[i]=ifelse(DF$Treatment_ID[i]==DF$Treatment_ID[i-1],round(as.numeric(DF$Valid[i]-DF$Valid[i-1])/30,0),
round(as.numeric(DF$Start_Date[i]-DF$Valid[i-1])/30,0))
}
This for
when the Treatment_Id
are equal computes the difference between the actual row and the previous with Valid
variable and when they are different the difference is computed using Start_Date
and Valid
. The first value of Break_Months
is NA
because there is not a previous value to compare. The problem appeared in the end of DF
when I used the previous lines of code.
Treatment_ID Start_Date Valid Break_Months
47 0098 2012-10-01 2016-07-30 1
48 0098 2012-10-01 2016-08-31 1
49 0031 2016-09-01 2016-09-30 0
50 0031 2016-09-01 2016-10-30 1
51 0031 2016-09-01 2016-11-30 1
52 0031 2016-09-01 2016-12-30 1
53 0031 2016-09-01 2017-01-30 1
54 0031 2016-09-01 2017-03-02 1
55 0031 2016-09-01 2017-03-30 1
56 0012 2017-03-01 2017-03-30 -1
The Treatment_Id
0012
has only one observation because it is new and the Valid
date is the same that the last observation of treatment 0031
. Because the Treatment_Id
0031
has been used in other months then difference is computed with the previous observations inside the treatment. In the case of 0012
this is not possible because the Valid
date of this is the same that the last observation of 0031
and 0012
does not have more observations because it is new. When that happens the comparison must be done with the last observation of the group previous to 0031
this is 0098
. By using the concept due to 0012
is not equal to 0098
the Break_Months
is computed by the difference between 2017-03-01
(Start_Date
) and 2016-08-31
(Valid
) giving a value of 6
by the same mechanic in the for
structure and not -1
.
My question is related to how to incorporate this consideration inside the for
. It has been very complex to try to do this because I do not know how to integrate a comparison related to the date (if they are equal as in he example) and looking for the previous group before the one that contains the same date. I have tried using lag
function from dplyr
package to avoid the for
but the results are not the same. The dput()
version of DF
is the next:
DF<-structure(list(Treatment_ID = c("0031", "0031", "0045", "0057",
"0057", "0098", "0098", "0098", "0098", "0098", "0098", "0098",
"0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098",
"0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098",
"0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098",
"0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098",
"0098", "0098", "0098", "0098", "0031", "0031", "0031", "0031",
"0031", "0031", "0031", "0012"), Start_Date = structure(c(1304208000,
1304208000, 1328054400, 1333238400, 1333238400, 1349049600, 1349049600,
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600,
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600,
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600,
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600,
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600,
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600,
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1472688000,
1472688000, 1472688000, 1472688000, 1472688000, 1472688000, 1472688000,
1488326400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Valid = structure(c(1306713600, 1309392000, 1330560000, 1335744000,
1338336000, 1351555200, 1354233600, 1356825600, 1359504000,
1362182400, 1364601600, 1367280000, 1369872000, 1372550400,
1375142400, 1377820800, 1380499200, 1383091200, 1385769600,
1388361600, 1391040000, 1393718400, 1396137600, 1398816000,
1401408000, 1404086400, 1412035200, 1414627200, 1417305600,
1419897600, 1422576000, 1425254400, 1427673600, 1432944000,
1435622400, 1440892800, 1443571200, 1446163200, 1448841600,
1451433600, 1454112000, 1456790400, 1459296000, 1461974400,
1464566400, 1467244800, 1469836800, 1472601600, 1475193600,
1477785600, 1480464000, 1483056000, 1485734400, 1488412800,
1490832000, 1490832000), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("Treatment_ID",
"Start_Date", "Valid"), row.names = c(NA, -56L), class = "data.frame")
Thanks for your help.
Update
One of these solutions worked perfect. Now I have a little issue when I have to compute a similar variables. First I compute the variable Elapsed
that is the difference between Valid
and Start_Date
. I use the next code:
DF$Elapsed=round(as.numeric(DF$Valid-DF$Start_Date)/30,0)
Then, the dilemma appears. I have to compute the next two variables Last1
and Last2
. I use the next code for this:
#Compute Last1
DF$Last1=NA
DF$Last1[1]=0
for(j in c(2:length(DF$Last1)))
{
DF$Last1[j]=ifelse(DF$Treatment_ID[j]==DF$Treatment_ID[j-1],DF$Last1[j-1],
ifelse(DF$Treatment_ID[j]!=DF$Treatment_ID[j-1],DF$Elapsed[j-1],0))
}
The code worked parcially because I have a similar problem related to the variable Break_Months
. In this case because 0031
and 0012
have the same Valid
value the comparison of Treatment_Id
must not be done with the last value of 0031
where 7 is assigned due to the logic of the loop (variable Elapsed
). In this case the correct value is 48 because the comparison must be structured with the last observation of group 0098
then we get the 48. I have tried to modify with last_obs_index
but I can not get the correct result.
Treatment_ID Start_Date Valid Break_Months Elapsed Last1
47 0098 2012-10-01 2016-07-30 1 47 2
48 0098 2012-10-01 2016-08-31 1 48 2
49 0031 2016-09-01 2016-09-30 0 1 48
50 0031 2016-09-01 2016-10-30 1 2 48
51 0031 2016-09-01 2016-11-30 1 3 48
52 0031 2016-09-01 2016-12-30 1 4 48
53 0031 2016-09-01 2017-01-30 1 5 48
54 0031 2016-09-01 2017-03-02 1 6 48
55 0031 2016-09-01 2017-03-30 1 7 48
56 0012 2017-03-01 2017-03-30 6 1 7
For the variable Last2
I use the next code:
#Compute Last2
DF$Last2=NA
DF$Last2[1]=0
for(k in c(2:length(DF$Last2)))
{
DF$Last2[k]=ifelse(DF$Treatment_ID[k]==DF$Treatment_ID[k-1],DF$Last2[k-1],
ifelse(DF$Treatment_ID[k]!=DF$Treatment_ID[k-1],DF$Break_Months[k],0))
}
In this case it seems that worked but that is not true. Despite the fact that 6 is correct, the comparison is not well defined because 0012
and 0031
has the same Valid
date and the optimal comparison is using the last observation of 0098
group.Therefore, the value of Break_Months
is assigned. Again I could not fix the loop with the proper logic defined with last_obs_index
.
Treatment_ID Start_Date Valid Break_Months Elapsed Last1 Last2
47 0098 2012-10-01 2016-07-30 1 47 2 4
48 0098 2012-10-01 2016-08-31 1 48 2 4
49 0031 2016-09-01 2016-09-30 0 1 48 0
50 0031 2016-09-01 2016-10-30 1 2 48 0
51 0031 2016-09-01 2016-11-30 1 3 48 0
52 0031 2016-09-01 2016-12-30 1 4 48 0
53 0031 2016-09-01 2017-01-30 1 5 48 0
54 0031 2016-09-01 2017-03-02 1 6 48 0
55 0031 2016-09-01 2017-03-30 1 7 48 0
56 0012 2017-03-01 2017-03-30 6 1 7 6
Thanks for all the help this time, would it be possible to get advice about how to adapt the loops to get the comparisons in the right way.
Here's a method making use of a few extra dplyr functions like lag
and if_else
. It works out the comparison date (kept here for a sanity check), subtracts the previous Valid date from this, and then converts to rounded number of "months" (30 day periods).
library(dplyr)
mutate(DF,
comparison_date = if_else(Treatment_ID == lag(Treatment_ID), Valid, Start_Date),
Break_Months = difftime(comparison_date, lag(Valid), units = "days"),
Break_Months = as.numeric(round(Break_Months / 30)))
#> Treatment_ID Start_Date Valid comparison_date Break_Months
#> 1 0031 2011-05-01 2011-05-30 <NA> NA
#> 2 0031 2011-05-01 2011-06-30 2011-06-30 1
#> 3 0045 2012-02-01 2012-03-01 2012-02-01 7
#> 4 0057 2012-04-01 2012-04-30 2012-04-01 1
#> 5 0057 2012-04-01 2012-05-30 2012-05-30 1
#> 6 0098 2012-10-01 2012-10-30 2012-10-01 4
#> 7 0098 2012-10-01 2012-11-30 2012-11-30 1
#> 8 0098 2012-10-01 2012-12-30 2012-12-30 1
#> 9 0098 2012-10-01 2013-01-30 2013-01-30 1
#> 10 0098 2012-10-01 2013-03-02 2013-03-02 1
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With