Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linear Interpolation using dplyr

Tags:

r

dplyr

xts

I'm trying to use the na.approx() function from the zoo library (in conjunction with xts) to interpolate missing values from repeated measures data for multiple individuals with multiple measurements.

Sample data...

event.date <- c("2010-05-25", "2010-09-10", "2011-05-13", "2012-03-28", "2013-03-07",    
                "2014-02-13", "2010-06-11", "2010-09-10", "2011-05-13", "2012-03-28",
                "2013-03-07", "2014-02-13")
variable   <- c("neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd",
                "wbody.bmd", "wbody.bmd", "wbody.bmd", "wbody.bmd", "wbody.bmd", "wbody.bmd")
value      <- c(0.7490, 0.7615, 0.7900, 0.7730, NA, 0.7420, 1.0520, 1.0665, 1.0760,
                1.0870, NA, 1.0550)
## Bind into a data frame
df <- data.frame(event.date, variable, value)
rm(event.date, variable, value)
## Convert date
df$event.date <- as.Date(df$event.date)
## Load libraries
library(magrittr)
library(xts)
library(zoo)

I can interpolate one missing data point for a single outcome for a given person using xts() and na.approx()....

## Subset one variable
wbody <- subset(df, variable == "wbody.bmd")
## order/index and then interpolate
xts(wbody$value, wbody$event.date) %>%
  na.approx()
2010-06-11 1.052000
2010-09-10 1.066500
2011-05-13 1.076000
2012-03-28 1.087000
2013-03-07 1.070977
2014-02-13 1.055000

Not ideal having a matrix returned, but I can work around that. The main problem I have though is that I've multiple outcomes for multiple people. I, perhaps naively thought that since this is therefore a split-apply-combine problem that I could utilise dplyr to achieve this in the following manner...

## Load library
library(dplyr)
## group and then arrange the data (to ensure dates are correct)
df %>%
  group_by(variable) %>%
    arrange(variable, event.date) %>%
      xts(.$value, .$event.date) %>%
        na.approx()

Error in xts(., .$value, .$event.date) : order.by requires an appropriate time-based object

It seems that dplyr doesn't play well with xts/zoo and I've spent a couple of hours searching around trying to find tutorials/examples on how to interpolate missing data points in R, but all I've found are single case examples and so far I've been unable to find anything on how to do this for multiple sites for multiple people (I realise I could make it just a multiple people problem by reshaping my data to wide but that still wouldn't solve the problem I'm encountering).

Any thoughts/advice/insights on how to proceed would be greatly appreciated.

Thanks

EDIT : Clarification that some functions come from zoo package.

like image 928
slackline Avatar asked Jan 13 '15 11:01

slackline


2 Answers

Use the approx() function for linear-interpolation:

df %>%
  group_by(variable) %>%
    arrange(variable, event.date) %>%
    mutate(time=seq(1,n())) %>%
      mutate(ip.value=approx(time,value,time)$y) %>%
      select(-time)

or the spline function for non-linear interpolation:

df %>%
  group_by(variable) %>%
    arrange(variable, event.date) %>%
    mutate(time=seq(1,n())) %>%
      mutate(ip.value=spline(time,value ,n=n())$y) %>%
      select(-time)
like image 191
sumtxt Avatar answered Oct 06 '22 19:10

sumtxt


The solution I've gone with is based on the first comment from @docendodiscimus

Rather than attempt to create a new data frame as I'd been doing this approach simply adds columns to the existing data frame by taking advantage of dplyr's mutate() function.

My code is now...

df %>%
  group_by(variable) %>%
    arrange(variable, event.date) %>%
      mutate(ip.value = na.approx(value, maxgap = 4, rule = 2))

The maxgap allows upto four consecutive NA's, whilst the rule option allows extrapolation into the flanking time points.

like image 43
slackline Avatar answered Oct 06 '22 19:10

slackline