Fill in missing values by group in data.table

Q: How do you omit missing values?

First, if we want to exclude missing values from mathematical operations use the na. rm = TRUE argument. If you do not exclude these values most functions will return an NA . We may also desire to subset our data to obtain complete observations, those observations (rows) in our data that contain no missing data.

Q: How I can fill the columns with missing pieces of information?

How I can fill the columns with missing pieces of information (article number, article name) based on the Source Data, previous ranking period Same columns in both tables Same columns in both tables Same columns in both tables Missing info: Article-nr and Article - same as on photo 1 same values in other columnes between those two tables.

Q: How do you fill missing data in Excel with mean?

Filling missing data with a mean or median value is applicable when the columns involved have integer or float data types. You can also fill missing data with the mode value, which is the most occurring value. This is also applicable to integers or floats.

Q: How to fill missing data using PANDAS?

You can fix missing data by either dropping or filling them with other values. In this article, we'll explain and explore the different ways to fill missing data using pandas. 1. Use the fillna () Method:

Q: When do we need to replace data with missing values?

Instead of filling missing values, we sometimes need to replace the data with missing values. This might be required in situations when missing values are coded with a number or the actual values are not useful or sensible for the data study.

Tags:

r

data.table

If one wants to fill in missing values of a variable based on previous/posterior non NA observation within a group, the data.table command is

setkey(DT,id,date)
DT[, value_filled_in := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]]

which is quite complex. It's a shame since roll is a very fast and powerful option (esp compared with applying a function such as zoo::na.locf within each group)

I can write a convenience function to fill in missing values

   fill_na <-  function(x , by = NULL, roll =TRUE , rollends= if (roll=="nearest") c(TRUE,TRUE)
             else if (roll>=0) c(FALSE,TRUE)
             else c(TRUE,FALSE)){
    id <- seq_along(x)
    if (is.null(by)){
      DT <- data.table("x" = x, "id" = id, key = "id") 
      return(DT[!is.na(x)][DT[, list(id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])

    } else{
      DT <- data.table("x" = x, "by" = by, "id" = id, key = c("by", "id")) 
      return(DT[!is.na(x)][DT[, list(by, id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])
    }
  }

And then write

setkey(DT,id, date)
DT[, value_filled_in := fill_na(value, by = id)]

This is not really satisfying since one would like to write

setkey(DT,id, date)
DT[, value_filled_in := fill_na(value), by = id]

However, this takes a huge amount of time to run. And, for the end-user, it is cumbersome to learn that fill_na should be called with the by option, and should not be used with data.table by. Is there an elegant solution around this?

Some speed test

N <- 2e6
set.seed(1)
DT <- data.table(
         date = sample(10, N, TRUE),
           id = sample(1e5, N, TRUE),   
        value = sample(c(NA,1:5), N, TRUE),
       value2 = sample(c(NA,1:5), N, TRUE)                   
      )
setkey(DT,id,date)
DT<- unique(DT)

system.time(DT[, filled0 := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]])
#> user  system elapsed 
#>  0.086   0.006   0.105 
system.time(DT[, filled1 := zoo::na.locf.default(value, na.rm = FALSE), by = id])
#> user  system elapsed 
#> 5.235   0.016   5.274 
# (lower speed and no built in option like roll=integer or roll=nearest, rollend, etc)
system.time(DT[, filled2 := fill_na(value, by = id)])
#>   user  system elapsed 
#>  0.194   0.019   0.221 
system.time(DT[, filled3 := fill_na(value), by = id])
#>    user  system elapsed 
#> 237.256   0.913 238.405

Why don't I just use na.locf.default ? Even though the speed difference is not really important, the same issue arises for other kinds of data.table commands (those that rely on a merge by the variable in "by") - it's a shame to systematically ignore them in order to get an easier syntax. I also really like all the roll options.

794

asked Oct 03 '14 01:10

Matthew

2 Answers

Here's a slightly faster and more compact way of doing it (version 1.9.3+):

DT[, filled4 := DT[!is.na(value)][DT, value, roll = T]]

183

answered Sep 19 '22 04:09

eddi

There is now a native data.table way of filling missing values (as of 1.12.4).

This question spawned a github issue which was recently closed with the creation of functions nafill and setnafill. You can now use

DT[, value_filled_in := nafill(value, type = "locf")]

It is also possible to fill NA with a constant value or next observation carried back.

One difference to the approach in the question is that these functions currently only work on NA not NaN whereas is.na is TRUE for NaN - this is planned to be fixed in the next release through an extra argument.

I have no involvement with the project but I saw that although the github issue links here, there was no link the other way so I'm answering on behalf of future visitors.

Update: By default NaN is now treated same as NA.

answered Sep 19 '22 04:09

anotherfred

Related questions
                            
                                Using setDT inside a function
                            
                                Using R to connect to a sharepoint list
                            
                                How to shade part of a density curve in ggplot (with no y axis data)
                            
                                Using ipython magics in R jupyter notebook?
                            
                                set missing values for multiple labelled variables
                            
                                Programming with dplyr using string as input
                            
                                Warning: unable to access index for repository https://www.stats.ox.ac.uk/pub/RWin/src/contrib: [duplicate]
                            
                                How to use Cairo PNGs in R Markdown
                            
                                Discounted Cumulative Sum in R
                            
                                Reshape data frame to convert factors into columns in R
                            
                                trying to remove all margins so that plot region comprises the entire graphic
                            
                                Using Rcpp within parallel code via snow to make a cluster
                            
                                R Plot Specify number of time tickmarks - time/date equivalent to pretty
                            
                                Why do rapply and lapply handle NULL differently?
                            
                                Rscript vs. source: What are the key differences?
                            
                                Why apply() returns a transposed xts matrix?
                            
                                ggplot2 boxplot with labelled rug
                            
                                What's the best way to replace missing values with NA when reading in a .csv?
                            
                                Make a rectangular legend, with rows and columns labeled, in grid
                            
                                grouped operations that result in length not equal to 1 or length of group in dplyr

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With