Let's assume I have a table as such:
Date Sales
09/01/2017 9000
09/02/2017 12000
09/03/2017 0
09/04/2017 11000
09/05/2017 14400
09/06/2017 0
09/07/2017 0
09/08/2017 21000
09/09/2017 15000
09/10/2017 23100
09/11/2017 0
09/12/2017 32000
09/13/2017 8000
The values in the table are estimated by an R program which I have no access to (it's a black box right now). Now there are a few days with 0 values which tend to creep in due to issues in our ingestion/ETL process. I need to estimate the values for the dates with 0 data.
Our approach is to:
Now if there's only one day with missing data between two good days, a straightforward mean would work. If there are two or more consecutive days with missing data, the mean would not work, so I'm trying to formulate a way to go about estimating values for multiple data points.
Would this approach work in R? I'm a total n00b at R so I'm not really sure if this is even feasible.
You can fill in the values with linear interpolation using the function approxfun
.
## Your data
df = read.table(text="Date Sales
09/01/2017 9000
09/02/2017 12000
09/03/2017 0
09/04/2017 11000
09/05/2017 14400
09/06/2017 0
09/07/2017 0
09/08/2017 21000
09/09/2017 15000
09/10/2017 23100
09/11/2017 0
09/12/2017 32000
09/13/2017 8000",
header=TRUE, stringsAsFactors=FALSE)
df$Date = as.Date(df$Date, format="%m/%d/%Y")
## Create function for linear interpolation
Interp = approxfun(df[df$Sales > 0, ])
## Use function to fill in interpolated values
Vals = Interp(df$Date[df$Sales == 0])
df$Sales[df$Sales == 0] = Vals
plot(df, type="l")
grid()
We can also use the na.interpolation
function from the imputeTS
package. The default method of na.interpolation
is linear interpolation, but we can also specify other methods if we want.
library(dplyr)
library(imputeTS)
dt2 <- dt %>%
replace(. == 0, NA) %>%
mutate(Sales = na.interpolation(Sales))
dt2
Date Sales
1 09/01/2017 9000
2 09/02/2017 12000
3 09/03/2017 11500
4 09/04/2017 11000
5 09/05/2017 14400
6 09/06/2017 16600
7 09/07/2017 18800
8 09/08/2017 21000
9 09/09/2017 15000
10 09/10/2017 23100
11 09/11/2017 27550
12 09/12/2017 32000
13 09/13/2017 8000
Data
dt <- read.table(text = "Date Sales
09/01/2017 9000
09/02/2017 12000
09/03/2017 0
09/04/2017 11000
09/05/2017 14400
09/06/2017 0
09/07/2017 0
09/08/2017 21000
09/09/2017 15000
09/10/2017 23100
09/11/2017 0
09/12/2017 32000
09/13/2017 8000",
header = TRUE, stringsAsFactors = FALSE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With