Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill in missing values by group in data.table

Tags:

r

data.table

If one wants to fill in missing values of a variable based on previous/posterior non NA observation within a group, the data.table command is

setkey(DT,id,date)
DT[, value_filled_in := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]]

which is quite complex. It's a shame since roll is a very fast and powerful option (esp compared with applying a function such as zoo::na.locf within each group)

I can write a convenience function to fill in missing values

   fill_na <-  function(x , by = NULL, roll =TRUE , rollends= if (roll=="nearest") c(TRUE,TRUE)
             else if (roll>=0) c(FALSE,TRUE)
             else c(TRUE,FALSE)){
    id <- seq_along(x)
    if (is.null(by)){
      DT <- data.table("x" = x, "id" = id, key = "id") 
      return(DT[!is.na(x)][DT[, list(id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])

    } else{
      DT <- data.table("x" = x, "by" = by, "id" = id, key = c("by", "id")) 
      return(DT[!is.na(x)][DT[, list(by, id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])
    }
  }

And then write

setkey(DT,id, date)
DT[, value_filled_in := fill_na(value, by = id)]

This is not really satisfying since one would like to write

setkey(DT,id, date)
DT[, value_filled_in := fill_na(value), by = id]

However, this takes a huge amount of time to run. And, for the end-user, it is cumbersome to learn that fill_na should be called with the by option, and should not be used with data.table by. Is there an elegant solution around this?

Some speed test

N <- 2e6
set.seed(1)
DT <- data.table(
         date = sample(10, N, TRUE),
           id = sample(1e5, N, TRUE),   
        value = sample(c(NA,1:5), N, TRUE),
       value2 = sample(c(NA,1:5), N, TRUE)                   
      )
setkey(DT,id,date)
DT<- unique(DT)

system.time(DT[, filled0 := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]])
#> user  system elapsed 
#>  0.086   0.006   0.105 
system.time(DT[, filled1 := zoo::na.locf.default(value, na.rm = FALSE), by = id])
#> user  system elapsed 
#> 5.235   0.016   5.274 
# (lower speed and no built in option like roll=integer or roll=nearest, rollend, etc)
system.time(DT[, filled2 := fill_na(value, by = id)])
#>   user  system elapsed 
#>  0.194   0.019   0.221 
system.time(DT[, filled3 := fill_na(value), by = id])
#>    user  system elapsed 
#> 237.256   0.913 238.405 

Why don't I just use na.locf.default ? Even though the speed difference is not really important, the same issue arises for other kinds of data.table commands (those that rely on a merge by the variable in "by") - it's a shame to systematically ignore them in order to get an easier syntax. I also really like all the roll options.

like image 794
Matthew Avatar asked Oct 03 '14 01:10

Matthew


People also ask

What is a good way to fill in missing values in a dataset?

Use the fillna() Method: The fillna() function iterates through your dataset and fills all null rows with a specified value. It accepts some optional arguments—take note of the following ones: Value: This is the value you want to insert into the missing rows. Method: Lets you fill missing values forward or in reverse.

How do you show missing data in a table?

Right-click Variable with missing values in the table preview on the canvas pane and select Categories and Totals from the pop-up menu. Click (check) Missing Values in the Categories and Totals dialog box, and then click Apply. Now the table preview includes a Missing Values category.

What is .n in data table?

Think of .N as a variable for the number of instances. For example: dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7) dt[.N] # returns the last row # a b # 1: C 7.

How do you omit missing values?

First, if we want to exclude missing values from mathematical operations use the na. rm = TRUE argument. If you do not exclude these values most functions will return an NA . We may also desire to subset our data to obtain complete observations, those observations (rows) in our data that contain no missing data.

How I can fill the columns with missing pieces of information?

How I can fill the columns with missing pieces of information (article number, article name) based on the Source Data, previous ranking period Same columns in both tables Same columns in both tables Same columns in both tables Missing info: Article-nr and Article - same as on photo 1 same values in other columnes between those two tables.

How do you fill missing data in Excel with mean?

Filling missing data with a mean or median value is applicable when the columns involved have integer or float data types. You can also fill missing data with the mode value, which is the most occurring value. This is also applicable to integers or floats.

How to fill missing data using PANDAS?

You can fix missing data by either dropping or filling them with other values. In this article, we'll explain and explore the different ways to fill missing data using pandas. 1. Use the fillna () Method:

When do we need to replace data with missing values?

Instead of filling missing values, we sometimes need to replace the data with missing values. This might be required in situations when missing values are coded with a number or the actual values are not useful or sensible for the data study.


2 Answers

Here's a slightly faster and more compact way of doing it (version 1.9.3+):

DT[, filled4 := DT[!is.na(value)][DT, value, roll = T]]
like image 183
eddi Avatar answered Sep 19 '22 04:09

eddi


There is now a native data.table way of filling missing values (as of 1.12.4).

This question spawned a github issue which was recently closed with the creation of functions nafill and setnafill. You can now use

DT[, value_filled_in := nafill(value, type = "locf")]

It is also possible to fill NA with a constant value or next observation carried back.

One difference to the approach in the question is that these functions currently only work on NA not NaN whereas is.na is TRUE for NaN - this is planned to be fixed in the next release through an extra argument.

I have no involvement with the project but I saw that although the github issue links here, there was no link the other way so I'm answering on behalf of future visitors.

Update: By default NaN is now treated same as NA.

like image 21
anotherfred Avatar answered Sep 19 '22 04:09

anotherfred