Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently fill NAs by group

Tags:

performance

r

na

I have a dataset where I observe a variable for some individuals and not for others. For those individuals where I observe the variable, I observe it exactly once. However, the number of observations per individual as well as the position of the observed value varies.

I would like to fill all NA values for a given individual with the non-NA value in case there is a non-NA value. Otherwise, NAs are supposed to remain NA.

Here is an example dataset:

#data.frame of 100 individuals with 10 observations each
data <- data.frame(group = rep(1:100,each=10),value = NA)

#first 50 individuals get a value at the fifth observation, others don't have value
data$value[seq(5,500,10)] <- rnorm(50)

So far so good, not a big issue. Taken from another thread, we could do something like this using dplyr and tidyr:

data <- data %>% 
  group_by(group) %>% #by group
  fill(value) %>% #default direction down
  fill(value, .direction = "up") #also fill NAs upwards

This solves the problem perfectly fine. However, I have to do this for around 80mio. observations, which takes hours. Is there a faster method available? I think data.table might be a good candidate.

It would also be great if it would be possible to adapt the approach to fill only NAs that appear before the value.

Thanks!

like image 461
yrx1702 Avatar asked Jan 09 '19 12:01

yrx1702


3 Answers

You can use a pretty simple approach with both data.table and dplyr which - I believe - will be quite fast and efficient:

in data.table:

library(data.table)
setDT(data)
data[, value := value[!is.na(value)][1L], by = group]

or dplyr:

library(dplyr)
data <- data %>% 
  group_by(group) %>% 
  mutate(value = value[!is.na(value)][1L])

The point is you hava a non-NA value exactly o or 1 times per group. Hence you don'T need the last-observation-carried-forward logic. Just take the first non-NA value (if it exists).

like image 70
talat Avatar answered Oct 06 '22 16:10

talat


We could use data.table to assign in place. Here, na.locf from zoo is used for filling the NA elements with adjacent non-NA element

library(data.table)
library(zoo)
setDT(data)[, value := na.locf(na.locf(value, na.rm = FALSE), fromLast = TRUE), group]

Benchmarks

set.seed(24)
data1 <- data.frame(group = rep(1:1e6,each=10),value = NA)
data1$value[seq(5,1e6,10)] <- rnorm(100000)

data2 <- copy(data1)

system.time({setDT(data2)[, value := na.locf(na.locf(value, 
             na.rm = FALSE), fromLast = TRUE), group]})
#   user  system elapsed 
# 70.681   0.294  70.917 


system.time({

data1 %>% 
  group_by(group) %>% #by group
  fill(value) %>% #default direction down
  fill(value, .direction = "up")

})
# 17% ~33 m remaining 

NOTE: It took a lot of time. So have to abort the session.

NOTE2 : This approach is baesd on the assumption that we want to replace the NA elements with the non-NA adjacent elements and have more than one non-NA elements per group

like image 3
akrun Avatar answered Oct 06 '22 15:10

akrun


This is the code I have used: Your code vs akrun vs mine. Sometimes zoo is not the fastest process but it is the cleanest. Anyway, you can test it.

UPDATE: It has been tested with more data (100.000) and Process 03 (subset and merge) wins by far.

Last UPDATE Function comparison with rbenchmark:

library(dplyr)
library(tidyr)
library(base)
library(data.table)
library(zoo)
library(rbenchmark)

#data.frame of 100 individuals with 10 observations each
data <- data.frame(group = rep(1:10000,each=10),value = NA)
data$value[seq(5,5000,10)] <- rnorm(50) #first 50 individuals get a value at the fifth observation, others don't have value

#Process01
P01 <- function (data){
    data01 <- data %>% 
        group_by(group) %>% #by group
            fill(value) %>% #default direction down
            fill(value, .direction = "up") #also fill NAs upwards
    return(data01)
}

#Process02
P02 <- function (data){
    data02 <- setDT(data)[, value := na.locf(na.locf(value, na.rm = FALSE), 
                                             fromLast = TRUE), group]
    return(data02)
}

#Process03
P03 <- function (data){
    dataU <- subset(unique(data), value!='NA') #keep row number
    dataM <- merge(data, dataU, by = "group", all=T) #merge tables
    data03 <- data.frame(group=dataM$group, value = dataM$value.y) #idem shape of data
    return(data03)
}

benchmark("P01_dplyr" = {data01 <- P01(data)},
          "P02_zoo" = {data02 <- P02(data)},
          "P03_data.table" = {data03 <- P03(data)},
          replications = 10,
          columns = c("test", "replications", "elapsed")
          )

Results with data=10.000, 10 reps and I5 7400:

    test replications elapsed
1      P01_dplyr           10  257.78
2        P02_zoo           10   10.35
3 P03_data.table           10    0.09
like image 2
César Arquero Avatar answered Oct 06 '22 15:10

César Arquero