I am trying to fill NA values in a column with other non-NA values within the same group in R. So my data looks something like this:
df
id year pop
1 E1 2000 NA
2 E2 2000 NA
3 E2 2001 NA
4 E2 2003 120
5 E2 2005 125
6 E3 1999 115
7 E3 2001 300
8 E3 2003 NA
9 E4 2004 10
10 E4 2005 NA
11 E4 2008 NA
12 E4 2009 9
13 E5 2002 12
14 E5 2003 80
And I want NA values to have either the last non-NA or the next non-NA value of pop
within the same group of id
. To look something like this:
df.desired
id year pop
1 E1 2000 NA
2 E2 2000 120
3 E2 2001 120
4 E2 2003 120
5 E2 2005 125
6 E3 1999 115
7 E3 2001 300
8 E3 2003 300
9 E4 2004 10
10 E4 2005 10
11 E4 2008 9
12 E4 2009 9
13 E5 2002 12
14 E5 2003 80
I tried different things with both zoo::na.locf()
and dplyr::fill()
but I keep having two main issues: 1. I get errors because entire groups only have NA (like id == "E1"
here) and 2. I can only choose either the last or the naxt non-NA value.
These are some examples of what I've tried:
library(tidyverse)
library(zoo)
df.desired <- df %>%
group_by(id) %>%
arrange(year)%>%
mutate(pop_imputated = pop)%>%
fill(pop_imputated)%>%
ungroup()
df.desired <- df %>%
group_by(id) %>%
arrange(year)%>%
mutate(pop_imputated = zoo::na.locf(pop))%>%
fill(pop_imputated)%>%
ungroup()
Any ideas? Thanks a lot!
Here is an answer that would match your expected output exactly: it will impute to the nearest non-missing value, either upward or downward.
Here is the code, using a spiced up version of your example:
library(tidyverse)
df = structure(list(id = c("E1", "E2", "E2", "E2", "E2", "E3", "E3", "E3", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E5", "E5"),
year = c(2000L, 2000L, 2001L, 2003L, 2005L, 1999L, 2001L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2018L, 2019L, 2002L, 2003L),
pop = c(NA, NA, NA, 120L, 125L, 115L, 300L, NA, 10L, NA, NA, NA, NA, 9L, NA, 8L, 12L, 80L),
pop_exp = c(NA, 120L, 120L, 120L, 125L, 115L, 300L, 300L, 10L, 10L, 10L, 9L, 9L, 9L, 9L, 8L, 12L, 80L)),
class = "data.frame", row.names = c(NA, -18L))
fill_nearest = function(x){
keys=which(!is.na(x))
if(length(keys)==0) return(NA)
b = map_dbl(seq.int(x), ~keys[which.min(abs(.x-keys))])
x[b]
}
df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(pop_imputated = fill_nearest(pop)) %>%
ungroup()
#> # A tibble: 18 x 5
#> id year pop pop_exp pop_imputated
#> <chr> <int> <int> <int> <int>
#> 1 E1 2000 NA NA NA
#> 2 E2 2000 NA 120 120
#> 3 E2 2001 NA 120 120
#> 4 E2 2003 120 120 120
#> 5 E2 2005 125 125 125
#> 6 E3 1999 115 115 115
#> 7 E3 2001 300 300 300
#> 8 E3 2003 NA 300 300
#> 9 E4 2004 10 10 10
#> 10 E4 2005 NA 10 10
#> 11 E4 2006 NA 10 10
#> 12 E4 2007 NA 9 9
#> 13 E4 2008 NA 9 9
#> 14 E4 2009 9 9 9
#> 15 E4 2018 NA 9 9
#> 16 E4 2019 8 8 8
#> 17 E5 2002 12 12 12
#> 18 E5 2003 80 80 80
Created on 2021-05-13 by the reprex package (v2.0.0)
As I had to use a purrr
loop, it might get a bit slow in a huge dataset though.
EDIT: I suggested to add this option in tidyr::fill()
: https://github.com/tidyverse/tidyr/issues/1119. The issue also contains a tweaked version of this function to use the year
column as the reference to calculate the "distance" between the values. For instance, you would rather have row 15 as 8 than as 9 because the year is much closer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With