Let's say I have a dataframe containing the sales for some quarters, while the values for the following quarters are missing. I would like to replace the NAs by a simple formula (with mutate/dplyr like below). The issue is that I don't want to use mutate so many times. How could I do that for all NAs at the same time? Is there a way?
structure(list(Period = c("1999Q1", "1999Q2", "1999Q3", "1999Q4",
"2000Q1", "2000Q2", "2000Q3", "2000Q4", "2001Q1", "2001Q2", "2001Q3",
"2001Q4", "2002Q1", "2002Q2", "2002Q3", "2002Q4", "2003Q1", "2003Q2",
"2003Q3", "2003Q4"), Sales= c(353.2925571, 425.9299841, 357.5204626,
363.80247, 302.8081066, 394.328576, 435.15573, 387.99768, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-20L))
test %>%
mutate(Sales = ifelse(is.na(Sales), 1.05*lag(Sales, 4), Sales)) %>%
mutate(Sales = ifelse(is.na(Sales), 1.05*lag(Sales, 4), Sales)) %>%
mutate(Sales = ifelse(is.na(Sales), 1.05*lag(Sales, 4), Sales))
To replace the missing values, you first identify the NA's with the is.na() function and the $-operator. Then, you use the min() function to replace the NA's with the lowest value.
You can use mean value to replace the missing values in case the data distribution is symmetric. Consider using median or mode with skewed data distribution. Pandas Dataframe method in Python such as fillna can be used to replace the missing values.
In a mean substitution, the mean value of a variable is used in place of the missing data value for that same variable.
One dplyr
and tidyr
possibility could be:
df %>%
group_by(quarter = substr(Period, 5, 6)) %>%
mutate(Sales_temp = replace_na(Sales, last(na.omit(Sales)))) %>%
group_by(quarter, na = is.na(Sales)) %>%
mutate(constant = 1.05,
Sales_temp = Sales_temp * cumprod(constant),
Sales = coalesce(Sales, Sales_temp)) %>%
ungroup() %>%
select(1:2)
Period Sales
<chr> <dbl>
1 1999Q1 353.
2 1999Q2 426.
3 1999Q3 358.
4 1999Q4 364.
5 2000Q1 303.
6 2000Q2 394.
7 2000Q3 435.
8 2000Q4 388.
9 2001Q1 318.
10 2001Q2 414.
11 2001Q3 457.
12 2001Q4 407.
13 2002Q1 334.
14 2002Q2 435.
15 2002Q3 480.
16 2002Q4 428.
17 2003Q1 351.
18 2003Q2 456.
19 2003Q3 504.
20 2003Q4 449.
Or with just dplyr
:
df %>%
group_by(quarter = substr(Period, 5, 6)) %>%
mutate(Sales_temp = if_else(is.na(Sales), last(na.omit(Sales)), Sales)) %>%
group_by(quarter, na = is.na(Sales)) %>%
mutate(constant = 1.05,
Sales_temp = Sales_temp * cumprod(constant),
Sales = coalesce(Sales, Sales_temp)) %>%
ungroup() %>%
select(1:2)
x <- test$Sales
# find that last non-NA data
last.valid <- tail(which(!is.na(x)),1)
# store the "base"
base <- ceiling(last.valid/4)*4 + (-3:0)
base <- base + ifelse(base > last.valid, -4, 0)
base <- x[base]
# calculate the "exponents"
expos <- ceiling( ( seq(length(x)) - last.valid ) / 4 )
test$Sales <- ifelse(is.na(x), bases * 1.05 ^ expos, x)
tail(test)
# Period Sales
# 15 2002Q3 479.7592
# 16 2002Q4 427.7674
# 17 2003Q1 350.5382
# 18 2003Q2 456.4846
# 19 2003Q3 503.7472
# 20 2003Q4 449.1558
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With