Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Fill NAs with either last or next non NA value in R

I am trying to fill NA values in a column with other non-NA values within the same group in R. So my data looks something like this:

       id year pop
1  E1 2000  NA
2  E2 2000  NA
3  E2 2001  NA
4  E2 2003 120
5  E2 2005 125
6  E3 1999 115
7  E3 2001 300
8  E3 2003  NA
9  E4 2004  10
10 E4 2005  NA
11 E4 2008  NA
12 E4 2009   9
13 E5 2002  12
14 E5 2003  80

And I want NA values to have either the last non-NA or the next non-NA value of pop within the same group of id. To look something like this:

   id year pop
1  E1 2000  NA
2  E2 2000 120
3  E2 2001 120
4  E2 2003 120
5  E2 2005 125
6  E3 1999 115
7  E3 2001 300
8  E3 2003 300
9  E4 2004  10
10 E4 2005  10
11 E4 2008   9
12 E4 2009   9
13 E5 2002  12
14 E5 2003  80

I tried different things with both zoo::na.locf() and dplyr::fill() but I keep having two main issues: 1. I get errors because entire groups only have NA (like id == "E1" here) and 2. I can only choose either the last or the naxt non-NA value. These are some examples of what I've tried:

    df.desired <- df %>%
group_by(id) %>%
      mutate(pop_imputated = pop)%>%

df.desired <- df %>%
  group_by(id) %>%
  mutate(pop_imputated = zoo::na.locf(pop))%>%

Any ideas? Thanks a lot!

like image 933
AntVal Avatar asked Dec 18 '22 11:12


1 Answers

Here is an answer that would match your expected output exactly: it will impute to the nearest non-missing value, either upward or downward.

Here is the code, using a spiced up version of your example:

df = structure(list(id = c("E1", "E2", "E2", "E2", "E2", "E3", "E3", "E3", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E5", "E5"), 
                    year = c(2000L, 2000L, 2001L, 2003L, 2005L, 1999L, 2001L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2018L, 2019L, 2002L, 2003L), 
                    pop = c(NA, NA, NA, 120L, 125L, 115L, 300L, NA, 10L, NA, NA, NA, NA, 9L, NA, 8L, 12L, 80L), 
                    pop_exp = c(NA, 120L, 120L, 120L, 125L, 115L, 300L, 300L, 10L, 10L, 10L, 9L, 9L, 9L, 9L, 8L, 12L, 80L)), 
               class = "data.frame", row.names = c(NA, -18L))

fill_nearest = function(x){
  if(length(keys)==0) return(NA)
  b = map_dbl(seq.int(x), ~keys[which.min(abs(.x-keys))])

df %>% 
  group_by(id) %>% 
  arrange(id, year) %>%
  mutate(pop_imputated = fill_nearest(pop)) %>% 
#> # A tibble: 18 x 5
#>    id     year   pop pop_exp pop_imputated
#>    <chr> <int> <int>   <int>         <int>
#>  1 E1     2000    NA      NA            NA
#>  2 E2     2000    NA     120           120
#>  3 E2     2001    NA     120           120
#>  4 E2     2003   120     120           120
#>  5 E2     2005   125     125           125
#>  6 E3     1999   115     115           115
#>  7 E3     2001   300     300           300
#>  8 E3     2003    NA     300           300
#>  9 E4     2004    10      10            10
#> 10 E4     2005    NA      10            10
#> 11 E4     2006    NA      10            10
#> 12 E4     2007    NA       9             9
#> 13 E4     2008    NA       9             9
#> 14 E4     2009     9       9             9
#> 15 E4     2018    NA       9             9
#> 16 E4     2019     8       8             8
#> 17 E5     2002    12      12            12
#> 18 E5     2003    80      80            80

Created on 2021-05-13 by the reprex package (v2.0.0)

As I had to use a purrr loop, it might get a bit slow in a huge dataset though.

EDIT: I suggested to add this option in tidyr::fill(): https://github.com/tidyverse/tidyr/issues/1119. The issue also contains a tweaked version of this function to use the year column as the reference to calculate the "distance" between the values. For instance, you would rather have row 15 as 8 than as 9 because the year is much closer.

like image 166
Dan Chaltiel Avatar answered Dec 20 '22 00:12

Dan Chaltiel