I'm trying to use the na.approx()
function from the zoo
library (in conjunction with xts
) to interpolate missing values from repeated measures data for multiple individuals with multiple measurements.
Sample data...
event.date <- c("2010-05-25", "2010-09-10", "2011-05-13", "2012-03-28", "2013-03-07",
"2014-02-13", "2010-06-11", "2010-09-10", "2011-05-13", "2012-03-28",
"2013-03-07", "2014-02-13")
variable <- c("neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd",
"wbody.bmd", "wbody.bmd", "wbody.bmd", "wbody.bmd", "wbody.bmd", "wbody.bmd")
value <- c(0.7490, 0.7615, 0.7900, 0.7730, NA, 0.7420, 1.0520, 1.0665, 1.0760,
1.0870, NA, 1.0550)
## Bind into a data frame
df <- data.frame(event.date, variable, value)
rm(event.date, variable, value)
## Convert date
df$event.date <- as.Date(df$event.date)
## Load libraries
library(magrittr)
library(xts)
library(zoo)
I can interpolate one missing data point for a single outcome for a given person using xts()
and na.approx()
....
## Subset one variable
wbody <- subset(df, variable == "wbody.bmd")
## order/index and then interpolate
xts(wbody$value, wbody$event.date) %>%
na.approx()
2010-06-11 1.052000
2010-09-10 1.066500
2011-05-13 1.076000
2012-03-28 1.087000
2013-03-07 1.070977
2014-02-13 1.055000
Not ideal having a matrix returned, but I can work around that. The main problem I have though is that I've multiple outcomes for multiple people. I, perhaps naively thought that since this is therefore a split-apply-combine problem that I could utilise dplyr
to achieve this in the following manner...
## Load library
library(dplyr)
## group and then arrange the data (to ensure dates are correct)
df %>%
group_by(variable) %>%
arrange(variable, event.date) %>%
xts(.$value, .$event.date) %>%
na.approx()
Error in xts(., .$value, .$event.date) : order.by requires an appropriate time-based object
It seems that dplyr
doesn't play well with xts
/zoo
and I've spent a couple of hours searching around trying to find tutorials/examples on how to interpolate missing data points in R, but all I've found are single case examples and so far I've been unable to find anything on how to do this for multiple sites for multiple people (I realise I could make it just a multiple people problem by reshaping my data to wide but that still wouldn't solve the problem I'm encountering).
Any thoughts/advice/insights on how to proceed would be greatly appreciated.
Thanks
EDIT : Clarification that some functions come from zoo
package.
Use the approx()
function for linear-interpolation:
df %>%
group_by(variable) %>%
arrange(variable, event.date) %>%
mutate(time=seq(1,n())) %>%
mutate(ip.value=approx(time,value,time)$y) %>%
select(-time)
or the spline
function for non-linear interpolation:
df %>%
group_by(variable) %>%
arrange(variable, event.date) %>%
mutate(time=seq(1,n())) %>%
mutate(ip.value=spline(time,value ,n=n())$y) %>%
select(-time)
The solution I've gone with is based on the first comment from @docendodiscimus
Rather than attempt to create a new data frame as I'd been doing this approach simply adds columns to the existing data frame by taking advantage of dplyr
's mutate()
function.
My code is now...
df %>%
group_by(variable) %>%
arrange(variable, event.date) %>%
mutate(ip.value = na.approx(value, maxgap = 4, rule = 2))
The maxgap
allows upto four consecutive NA
's, whilst the rule
option allows extrapolation into the flanking time points.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With