Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace NA with most recent non-NA by group? [duplicate]

Tags:

r

I have a DF of individuals with some incomplete and repeated characteristics as following:

    name <- c("A", "A", "B", "B", "B", "C", "D", "D")
    age <- c(28,NA,NA,NA,NA,NA,53,NA)
    birthplace <- c("city1",NA, "city2",NA,NA,NA,NA,NA)
    value <- 100:107
    df <- data.frame(name,age,birthplace,value)

    name age birthplace value
1    A  28      city1   100
2    A  NA       <NA>   101
3    B  NA      city2   102
4    B  NA       <NA>   103
5    B  NA       <NA>   104
6    C  NA       <NA>   105
7    D  53       <NA>   106
8    D  NA       <NA>   107

Since the value is unique for row. I want complete each row with aviable person's detail like this:

       name age birthplace value
    1    A  28      city1   100
    2    A  28      city1   101
    3    B  NA      city2   102
    4    B  NA      city2   103
    5    B  NA      city2   104
    6    C  NA       <NA>   105
    7    D  53       <NA>   106
    8    D  53       <NA>   107

I tried to use

library(zoo)
library(dplyr)
df <- df %>% group_by(name) %>% na.locf(na.rm=F)

But it does't work very well. Any idea for implement function by group?

like image 531
Lingyu Kong Avatar asked Dec 02 '22 12:12

Lingyu Kong


2 Answers

As another base R solution, here is a poor man's na.locf

fill_down <- function(v) {
    if (length(v) > 1) {
        keep <- c(TRUE, !is.na(v[-1]))
        v[keep][cumsum(keep)]
    } else v
}

To fill down by group, the approach is to use tapply() to split and apply to each group, and split<- to combine groups to the original geometry, as

fill_down_by_group <- function(v, grp) {
    ## original 'by hand':
    ##     split(v, grp) <- tapply(v, grp, fill_down)
    ##     v
    ## done by built-in function `ave()`
    ave(v, grp, FUN=fill_down)
}

To process multiple columns, one might

elts <- c("age", "birthplace")
df[elts] <- lapply(df[elts], fill_down_by_group, df$name)

Notes

  1. I would be interested in seeing how a dplyr solution handles many columns, without hard-coding each? Answering my own question, I guess this is

    library(dplyr); library(tidyr)
    df %>% group_by(name) %>% fill_(elts)
    
  2. A more efficient base solution when the groups are already 'grouped' (e.g., identical(grp, sort(grp))) is

    fill_down_by_grouped <- function(v, grp) {
        if (length(v) > 1) {
            keep <- !(duplicated(v) & is.na(v))
            v[keep][cumsum(keep)]
        } else v
    }
    
  3. For me, fill_down() on a vector with about 10M elements takes ~225ms; fill_down_by_grouped() takes ~300ms independent of the number of groups; fill_down_by_group() scales with the number of groups; for 10000 groups ~2s, 10M groups about 36s

like image 129
Martin Morgan Avatar answered Dec 15 '22 09:12

Martin Morgan


Could also be:

library(dplyr)
library(tidyr)
df %>% group_by(name) %>% fill(age, birthplace)

# Source: local data frame [8 x 4]
# Groups: name [4]

#     name   age birthplace value
#   <fctr> <dbl>     <fctr> <int>
# 1      A    28      city1   100
# 2      A    28      city1   101
# 3      B    NA      city2   102
# 4      B    NA      city2   103
# 5      B    NA      city2   104
# 6      C    NA         NA   105
# 7      D    53         NA   106
# 8      D    53         NA   107
like image 43
Psidom Avatar answered Dec 15 '22 09:12

Psidom