edit
The question was originally asked for data.table
. A solution with any package would be interesting.
I am a little stuck with a particular variation of a more general problem. I have panel data that I am using with data.table and I would like to fill in some missing values using the group by functionality of data.table. Unfortunately they are not numeric, so I can't simply interpolate, but they should only be filled in based on a condition. Is it possible to perform a kind of conditional na.locf in data.tables?
Essentially I only want to fill in the NAs if after the NAs the next observation is the previous ones, though the more general question is how to conditionally fill in NAs.
For example, in the following data I would like to fill in the associatedid variable by each id group. So id==1
, year==2003
would fill in as ABC123
because its the value before and after the NA, but not 2000 for the same id. id== 2
would not be changed because the next value is not the same as the one prior to the NAs. id==3
would fill in for 2003 and 2004.
mydf <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L), associatedid = structure(c(NA, 1L, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, 2L, 2L, NA, 1L, 1L, NA, NA, 1L), .Label = c("ABC123", "DEF456"), class = "factor")), class = "data.frame", row.names = c(NA, -18L))
mydf
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 <NA>
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 <NA>
#> 17 3 2004 <NA>
#> 18 3 2005 ABC123
dt = data.table(mydf, key = c("id"))
desired output
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
Here's a pure tidyverse solution :
library(tidyverse)
mydf %>%
mutate(up = associatedid, down = associatedid) %>%
group_by(id) %>%
fill(up,.direction = "up") %>%
fill(down) %>%
mutate_at("associatedid", ~if_else(is.na(.) & up == down, up, .)) %>%
ungroup() %>%
select(-up, - down)
#> # A tibble: 18 x 3
#> id year associatedid
#> <int> <int> <fct>
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
Or using zoo::na.locf
:
library(dplyr)
library(zoo)
mydf %>%
group_by(id) %>%
mutate_at("associatedid", ~if_else(
is.na(.) & na.locf(.,F) == na.locf(.,F,fromLast = TRUE), na.locf(.,F), .)) %>%
ungroup()
#> # A tibble: 18 x 3
#> id year associatedid
#> <int> <int> <fct>
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
The same idea but using data.table :
library(zoo)
library(data.table)
setDT(mydf)
mydf[,associatedid := fifelse(
is.na(associatedid) & na.locf(associatedid,F) == na.locf(associatedid,F,fromLast = TRUE),
na.locf(associatedid,F), associatedid),
by = id]
mydf
#> id year associatedid
#> 1: 1 2000 <NA>
#> 2: 1 2001 ABC123
#> 3: 1 2002 ABC123
#> 4: 1 2003 ABC123
#> 5: 1 2004 ABC123
#> 6: 1 2005 ABC123
#> 7: 2 2000 <NA>
#> 8: 2 2001 ABC123
#> 9: 2 2002 ABC123
#> 10: 2 2003 <NA>
#> 11: 2 2004 DEF456
#> 12: 2 2005 DEF456
#> 13: 3 2000 <NA>
#> 14: 3 2001 ABC123
#> 15: 3 2002 ABC123
#> 16: 3 2003 ABC123
#> 17: 3 2004 ABC123
#> 18: 3 2005 ABC123
And finally a fun idea using base, noting that you want to interpolate only if constant interpolation and linear interpolation are the same, if this character variable was numeric :
i <- ave( as.numeric(factor(mydf$associatedid)), mydf$id,FUN = function(x) ifelse(
approx(x,xout = seq_along(x))$y == (z<- approx(x,xout = seq_along(x),method = "constant")$y),
z, x))
mydf$associatedid <- levels(mydf$associatedid)[i]
mydf
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
This is all about writing a modified na.locf function. After that you can plug it into data.table like any other function.
new.locf <- function(x){
# might want to think about the end of this loop
# this works here but you might need to add another case
# if there are NA's as the last value.
#
# anyway, loop through observations in a vector, x.
for(i in 2:(length(x)-1)){
nextval = i
# find the next, non-NA value
# again, not tested but might break if there isn't one?
while(nextval <= length(x)-1 & is.na(x[nextval])){
nextval = nextval + 1
}
# if the current value is not NA, great!
if(!is.na(x[i])){
x[i] <- x[i]
}else{
# if the current value is NA, and the last value is a value
# (should given the nature of this loop), and
# the next value, as calculated above, is the same as the last
# value, then give us that value.
if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){
x[i] <- x[nextval]
}else{
# finally, return NA if neither of these conditions hold
x[i] <- NA
}
}
}
# return the new vector
return(x)
}
Once we have that function, we can use data.table as usual:
dt2 <- dt[,list(year = year,
# when I read your data in, associatedid read as factor
associatedid = new.locf(as.character(associatedid))
),
by = "id"
]
This returns:
> dt2
id year associatedid
1: 1 2000 NA
2: 1 2001 ABC123
3: 1 2002 ABC123
4: 1 2003 ABC123
5: 1 2004 ABC123
6: 1 2005 ABC123
7: 2 2000 NA
8: 2 2001 ABC123
9: 2 2002 ABC123
10: 2 2003 NA
11: 2 2004 DEF456
12: 2 2005 DEF456
13: 3 2000 NA
14: 3 2001 ABC123
15: 3 2002 ABC123
16: 3 2003 ABC123
17: 3 2004 ABC123
18: 3 2005 ABC123
which is what you are looking for as best I understand it.
I provided some hedging in the new.locf definition so you still might have a little thinking to do but this should get you started.
If na.locf0
applied forward and backwards are the same then use na.locf0
; otherwise, if they are not equal or if either is NA then use NA.
library(data.table)
library(zoo)
dt[, associatedid :=
ifelse(na.locf0(associatedid) == na.locf0(associatedid, fromLast=TRUE),
na.locf0(associatedid), NA), by = id]
giving:
> dt
id year associatedid
1: 1 2000 <NA>
2: 1 2001 ABC123
3: 1 2002 ABC123
4: 1 2003 ABC123
5: 1 2004 ABC123
6: 1 2005 ABC123
7: 2 2000 <NA>
8: 2 2001 ABC123
9: 2 2002 ABC123
10: 2 2003 <NA>
11: 2 2004 DEF456
12: 2 2005 DEF456
13: 3 2000 <NA>
14: 3 2001 ABC123
15: 3 2002 ABC123
16: 3 2003 ABC123
17: 3 2004 ABC123
18: 3 2005 ABC123
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With