Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fill gaps in a timeseries with averages

Tags:

r

time-series

I have a dataframe like so:

day         sum_flux  samples mean
2005-10-26     0.02     48    0.02
2005-10-27     0.12     12    0.50

It's a series of daily readings spanning 5 years, however some of the days are missing. I want to fill these days with the average of that month from other years.

i.e if 26-10-2005 was missing I'd want to use the average of all Octobers in the data set. if all of October was missing I'd want to apply this average to each missing day.

I think I need to build a function (possibly using plyr) to evaluate the days. However I'm very inexperienced with using the various timeseries objects in R, and conditionally subsetting data and would like some advice. Especially regarding which type of timeseries I should be using.

Many Thanks

like image 223
BetaScoo8 Avatar asked Feb 23 '23 01:02

BetaScoo8


2 Answers

Some sample data. I'm assuming that sum_flux is the column that has missing values, and that you want to calculate values for.

library(lubridate)
days <- seq.POSIXt(ymd("2005-10-26"), ymd("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.frame(
  day      = days,
  sum_flux = runif(n_days),
  samples  = sample(100, n_days, replace = TRUE),
  mean     = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA

Add a month column.

readings$month <- month(readings$day, label = TRUE)

Use tapply to get the monthly mean flux.

monthly_avg_flux <- with(readings, tapply(sum_flux, month, mean, na.rm = TRUE))

Use this value whenever the flux is missing, or keep the flux if not.

readings$sum_flux2 <- with(readings, ifelse(
  is.na(sum_flux), 
  monthly_avg_flux[month], 
  sum_flux
))
like image 122
Richie Cotton Avatar answered Feb 26 '23 21:02

Richie Cotton


This is one (very fast) way in data.table.

Using the nice example data from Richie :

require(data.table)
days <- seq(as.IDate("2005-10-26"), as.IDate("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.table(
    day      = days,
    sum_flux = runif(n_days),
    samples  = sample(100, n_days, replace = TRUE),
    mean     = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA
readings
             day   sum_flux samples       mean
 [1,] 2005-10-26 0.32838686      94 0.09647325
 [2,] 2005-10-27 0.14686591      88 0.48728321
 [3,] 2005-10-28 0.25800913      51 0.72776002
 [4,] 2005-10-29 0.09628937      81 0.80954124
 [5,] 2005-10-30 0.70721591      23 0.60165240
 [6,] 2005-10-31 0.59555079       2 0.96849533
 [7,] 2005-11-01         NA      42 0.37566491
 [8,] 2005-11-02 0.01649860      89 0.48866220
 [9,] 2005-11-03 0.46802818      49 0.28920807
[10,] 2005-11-04 0.13024856      30 0.29051080
First 10 rows of 1827 printed.

Create the average for each month, in appearance order of each group :

> avg = readings[,mean(sum_flux,na.rm=TRUE),by=list(mnth = month(day))]
> avg
      mnth        V1
 [1,]   10 0.4915999
 [2,]   11 0.5107873
 [3,]   12 0.4451787
 [4,]    1 0.4966040
 [5,]    2 0.4972244
 [6,]    3 0.4952821
 [7,]    4 0.5106539
 [8,]    5 0.4717122
 [9,]    6 0.5110490
[10,]    7 0.4507383
[11,]    8 0.4680827
[12,]    9 0.5150618

Next reorder avg to start in January :

avg = avg[order(mnth)]
avg
      mnth        V1
 [1,]    1 0.4966040
 [2,]    2 0.4972244
 [3,]    3 0.4952821
 [4,]    4 0.5106539
 [5,]    5 0.4717122
 [6,]    6 0.5110490
 [7,]    7 0.4507383
 [8,]    8 0.4680827
 [9,]    9 0.5150618
[10,]   10 0.4915999
[11,]   11 0.5107873
[12,]   12 0.4451787

Now update by reference (:=) the sum_flux column, where sum_flux is NA, with the value from avg for that month.

readings[is.na(sum_flux), sum_flux:=avg$V1[month(day)]]
             day   sum_flux samples       mean
 [1,] 2005-10-26 0.32838686      94 0.09647325
 [2,] 2005-10-27 0.14686591      88 0.48728321
 [3,] 2005-10-28 0.25800913      51 0.72776002
 [4,] 2005-10-29 0.09628937      81 0.80954124
 [5,] 2005-10-30 0.70721591      23 0.60165240
 [6,] 2005-10-31 0.59555079       2 0.96849533
 [7,] 2005-11-01 0.51078729**    42 0.37566491  # ** updated with the Nov avg
 [8,] 2005-11-02 0.01649860      89 0.48866220
 [9,] 2005-11-03 0.46802818      49 0.28920807
[10,] 2005-11-04 0.13024856      30 0.29051080
First 10 rows of 1827 printed. 

Done.

like image 35
Matt Dowle Avatar answered Feb 26 '23 20:02

Matt Dowle