Thanks in advance.
I am trying to add missing date values that were not included in a observation period for three different individuals.
My data look like this:
IndID Date Event Number Percent
1 P01 2011-03-04 1 2 0.390
2 P01 2011-03-11 1 2 0.975
3 P01 2011-03-13 0 9 0.795
4 P01 2011-03-14 0 10 0.516
5 P01 2011-03-15 0 1 0.117
6 P01 2011-03-17 0 7 0.093
IndID
is the individual ID (P01
, P03
, P06
). Date
is obviously the Date. Event
is a binary variable indicating whether an event occurred (0
= no and 1
= yes).
Columns Number
and Percent
are not directly relevant, but need to be preserved and are thus included here.
My sample data frame (PostData
) is included below using dput
.
For each IndID
the first and last Date
are the beginning and end of an observation period respectively, within which there are missing dates. Here, my goal is to add the missing dates for each individual and add a 0
in the Event
column. The other columns (Number
and Percent
) can remain blank.
This post has been useful, but lacks info on my main problem - multiple individuals.
The observation period for each individual is from min(PostData$Date)
to max(PostData$Date)
. I have been attempting to create a complete Date sequence for each individual and then merge
it with the existing data frame within a for
loop. There is surely a better idea.
Any suggestions are appreciated.
PostData <-structure(list(IndID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L), .Label = c("P01", "P02", "P03", "P05", "P06", "P07",
"P08", "P09", "P10", "P11", "P12", "P13"), class = "factor"),
Date = structure(c(1299196800, 1299801600, 1299974400, 1300060800,
1300147200, 1300320000, 1300406400, 1310083200, 1310169600,
1310515200, 1310774400, 1310947200, 1311033600, 1311292800,
1311552000, 1323129600, 1323388800, 1323648000, 1323993600,
1324080000, 1324166400, 1324339200, 1327622400, 1327795200,
1327881600), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
Event = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L), Number = c(2L,
2L, 9L, 10L, 1L, 7L, 5L, 9L, 1L, 4L, 5L, 2L, 0L, 1L, 10L,
5L, 0L, 6L, 5L, 10L, 9L, 4L, 4L, 8L, 1L), Percent = c(0.39,
0.975, 0.795, 0.516, 0.117, 0.093, 0.528, 0.659, 0.308, 0.055,
0.185, 0.761, 0.132, 0.676, 0.368, 0.383, 0.272, 0.113, 0.974,
0.696, 0.941, 0.751, 0.758, 0.29, 0.15)), .Names = c("IndID",
"Date", "Event", "Number", "Percent"), row.names = c(NA, 25L),
class = "data.frame")
A base R version:
do.call(rbind,
by(
PostData,
PostData$IndID,
function(x) {
out <- merge(
data.frame(
IndID=x$IndID[1],
Date=seq.POSIXt(min(x$Date),max(x$Date),by="1 day")
),
x,
all.x=TRUE
)
out$Event[is.na(out$Event)] <- 0
out
}
)
)
Result:
IndID Date Event Number Percent
P01.1 P01 2011-03-04 1 2 0.390
P01.2 P01 2011-03-05 0 NA NA
P01.3 P01 2011-03-06 0 NA NA
P01.4 P01 2011-03-07 0 NA NA
P01.5 P01 2011-03-08 0 NA NA
P01.6 P01 2011-03-09 0 NA NA
P01.7 P01 2011-03-10 0 NA NA
P01.8 P01 2011-03-11 1 2 0.975
<<etc>>
Try this.. This will add missing Dates with proper IDs and remaining fields as 0
library(data.table)
library(plyr)
dtPostData = data.table(PostData)
minmaxTab = dtPostData[,list(minDate=min(Date),maxDate=max(Date)),by=IndID]
df = lapply(1:nrow(minmaxTab),function(x) {
temp = seq(minmaxTab$minDate[x],minmaxTab$maxDate[x],by=24*60*60)
temp = temp[!(temp %in% dtPostData[IndID == minmaxTab$IndID[x],]$Date)]
data.table(IndID = minmaxTab$IndID[x], Date = temp, Event = 0, Number = 0, Percent = 0)
})
df <- ldply(x, data.frame)
df
#Results
IndID Date Event Number Percent
1 P01 2011-03-05 0 0 0
2 P01 2011-03-06 0 0 0
3 P01 2011-03-07 0 0 0
4 P01 2011-03-08 0 0 0
5 P01 2011-03-09 0 0 0
6 P01 2011-03-10 0 0 0
7 P01 2011-03-12 0 0 0
8 P01 2011-03-16 0 0 0
9 P03 2011-07-10 0 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With