Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - estimating missing values

Tags:

r

data-quality

Let's assume I have a table as such:

Date        Sales
09/01/2017  9000
09/02/2017  12000
09/03/2017  0
09/04/2017  11000
09/05/2017  14400
09/06/2017  0
09/07/2017  0
09/08/2017  21000
09/09/2017  15000
09/10/2017  23100
09/11/2017  0
09/12/2017  32000
09/13/2017  8000

Here is what the data in the table looks like

The values in the table are estimated by an R program which I have no access to (it's a black box right now). Now there are a few days with 0 values which tend to creep in due to issues in our ingestion/ETL process. I need to estimate the values for the dates with 0 data.

Our approach is to:

  • Draw a line from the date prior to the missing data to the date right after the missing data
  • Estimate the value for the missing date from the line

Now if there's only one day with missing data between two good days, a straightforward mean would work. If there are two or more consecutive days with missing data, the mean would not work, so I'm trying to formulate a way to go about estimating values for multiple data points.

The intersection of the green and red lines would give the required values

Would this approach work in R? I'm a total n00b at R so I'm not really sure if this is even feasible.

like image 931
Craig Avatar asked Dec 23 '22 14:12

Craig


2 Answers

You can fill in the values with linear interpolation using the function approxfun.

## Your data
df = read.table(text="Date        Sales
09/01/2017  9000
09/02/2017  12000
09/03/2017  0
09/04/2017  11000
09/05/2017  14400
09/06/2017  0
09/07/2017  0
09/08/2017  21000
09/09/2017  15000
09/10/2017  23100
09/11/2017  0
09/12/2017  32000
09/13/2017  8000",
header=TRUE, stringsAsFactors=FALSE)
df$Date = as.Date(df$Date, format="%m/%d/%Y")


## Create function for linear interpolation
Interp = approxfun(df[df$Sales > 0, ])

## Use function to fill in interpolated values
Vals = Interp(df$Date[df$Sales == 0])
df$Sales[df$Sales == 0] = Vals
plot(df, type="l")
grid()

Interpolated values

like image 157
G5W Avatar answered Jan 12 '23 10:01

G5W


We can also use the na.interpolation function from the imputeTS package. The default method of na.interpolation is linear interpolation, but we can also specify other methods if we want.

library(dplyr)
library(imputeTS)

dt2 <- dt %>%
  replace(. == 0, NA) %>%
  mutate(Sales = na.interpolation(Sales))

dt2
         Date Sales
1  09/01/2017  9000
2  09/02/2017 12000
3  09/03/2017 11500
4  09/04/2017 11000
5  09/05/2017 14400
6  09/06/2017 16600
7  09/07/2017 18800
8  09/08/2017 21000
9  09/09/2017 15000
10 09/10/2017 23100
11 09/11/2017 27550
12 09/12/2017 32000
13 09/13/2017  8000

Data

dt <- read.table(text = "Date        Sales
09/01/2017  9000
                 09/02/2017  12000
                 09/03/2017  0
                 09/04/2017  11000
                 09/05/2017  14400
                 09/06/2017  0
                 09/07/2017  0
                 09/08/2017  21000
                 09/09/2017  15000
                 09/10/2017  23100
                 09/11/2017  0
                 09/12/2017  32000
                 09/13/2017  8000",
                 header = TRUE, stringsAsFactors = FALSE)
like image 42
www Avatar answered Jan 12 '23 11:01

www