I have a DF of individuals with some incomplete and repeated characteristics as following:
name <- c("A", "A", "B", "B", "B", "C", "D", "D")
age <- c(28,NA,NA,NA,NA,NA,53,NA)
birthplace <- c("city1",NA, "city2",NA,NA,NA,NA,NA)
value <- 100:107
df <- data.frame(name,age,birthplace,value)
name age birthplace value
1 A 28 city1 100
2 A NA <NA> 101
3 B NA city2 102
4 B NA <NA> 103
5 B NA <NA> 104
6 C NA <NA> 105
7 D 53 <NA> 106
8 D NA <NA> 107
Since the value is unique for row. I want complete each row with aviable person's detail like this:
name age birthplace value
1 A 28 city1 100
2 A 28 city1 101
3 B NA city2 102
4 B NA city2 103
5 B NA city2 104
6 C NA <NA> 105
7 D 53 <NA> 106
8 D 53 <NA> 107
I tried to use
library(zoo)
library(dplyr)
df <- df %>% group_by(name) %>% na.locf(na.rm=F)
But it does't work very well. Any idea for implement function by group?
As another base R solution, here is a poor man's na.locf
fill_down <- function(v) {
if (length(v) > 1) {
keep <- c(TRUE, !is.na(v[-1]))
v[keep][cumsum(keep)]
} else v
}
To fill down by group, the approach is to use tapply()
to split and apply to each group, and split<-
to combine groups to the original geometry, as
fill_down_by_group <- function(v, grp) {
## original 'by hand':
## split(v, grp) <- tapply(v, grp, fill_down)
## v
## done by built-in function `ave()`
ave(v, grp, FUN=fill_down)
}
To process multiple columns, one might
elts <- c("age", "birthplace")
df[elts] <- lapply(df[elts], fill_down_by_group, df$name)
I would be interested in seeing how a dplyr solution handles many columns, without hard-coding each? Answering my own question, I guess this is
library(dplyr); library(tidyr)
df %>% group_by(name) %>% fill_(elts)
A more efficient base solution when the groups are already 'grouped' (e.g., identical(grp, sort(grp))
) is
fill_down_by_grouped <- function(v, grp) {
if (length(v) > 1) {
keep <- !(duplicated(v) & is.na(v))
v[keep][cumsum(keep)]
} else v
}
For me, fill_down()
on a vector with about 10M elements takes ~225ms; fill_down_by_grouped()
takes ~300ms independent of the number of groups; fill_down_by_group()
scales with the number of groups; for 10000 groups ~2s, 10M groups about 36s
Could also be:
library(dplyr)
library(tidyr)
df %>% group_by(name) %>% fill(age, birthplace)
# Source: local data frame [8 x 4]
# Groups: name [4]
# name age birthplace value
# <fctr> <dbl> <fctr> <int>
# 1 A 28 city1 100
# 2 A 28 city1 101
# 3 B NA city2 102
# 4 B NA city2 103
# 5 B NA city2 104
# 6 C NA NA 105
# 7 D 53 NA 106
# 8 D 53 NA 107
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With