Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace missing values (NA) with most recent non-NA by group

Tags:

r

dplyr

I would like to solve the following problem with dplyr. Preferable with one of the window-functions. I have a data frame with houses and buying prices. The following is an example:

houseID      year    price 
1            1995    NA
1            1996    100
1            1997    NA
1            1998    120
1            1999    NA
2            1995    NA
2            1996    NA
2            1997    NA
2            1998    30
2            1999    NA
3            1995    NA
3            1996    44
3            1997    NA
3            1998    NA
3            1999    NA

I would like to make a data frame like this:

houseID      year    price 
1            1995    NA
1            1996    100
1            1997    100
1            1998    120
1            1999    120
2            1995    NA
2            1996    NA
2            1997    NA
2            1998    30
2            1999    30
3            1995    NA
3            1996    44
3            1997    44
3            1998    44
3            1999    44

Here are some data in the right format:

# Number of houses
N = 15

# Data frame
df = data.frame(houseID = rep(1:N,each=10), year=1995:2004, price =ifelse(runif(10*N)>0.15, NA,exp(rnorm(10*N))))

Is there a dplyr-way to do that?

like image 323
Peter Stephensen Avatar asked Apr 28 '14 11:04

Peter Stephensen


People also ask

Which function is used to replace the NA values with the most recent values?

locf() function from the zoo package to carry the last observation forward to replace your NA values.

How do I replace Na with missing values?

The easiest way to replace NA's with the mean in multiple columns is by using the functions mutate_at() and vars(). These functions let you select the columns in which you want to replace the missing values. To actually replace the NA with the mean, you can use the replace_na() and mean() function.

How do I replace missing values in NA with R?

So, how do you replace missing values with basic R code? To replace the missing values, you first identify the NA's with the is.na() function and the $-operator. Then, you use the min() function to replace the NA's with the lowest value.

How do I rename NA in R?

The classic way to replace NA's in R is by using the IS.NA() function. The IS.NA() function takes a vector or data frame as input and returns a logical object that indicates whether a value is missing (TRUE or VALUE). Next, you can use this logical object to create a subset of the missing values and assign them a zero.

How do you replace Na in R with last missing value?

In R, the easiest way to replace NA’s with the last, non-missing value is by using the fill () function from tidyr package. This function detects and substitutes missing values in a data frame with the last, non-missing value (per group). Alternatively, you can use the na.locf () function or the setnafill () function.

How to replace each Na with the most recent non-NA value?

The zoo R package contains the na.locf function, which is a generic function for replacing each NA with the most recent non-NA value prior to it. Let’s do this in practice:

When to replace missing values by neighboring nonmissing values?

The problems Users often want to replace missing values by neighboring nonmissing values, particularly when observations occur in some definite order, often (but not always) a time order.

How do you replace missing values with previous values?

If missing values occurred singly, then they could be replaced by the previous value Here the subscript notation used is that _n always refers to any given observation, _n−1 to the previous observation and _n+1 to the following observation, given the current sort order.


3 Answers

tidyr::fill now makes this stupidly easy:

library(dplyr) library(tidyr) # or library(tidyverse)  df %>% group_by(houseID) %>% fill(price) # Source: local data frame [15 x 3] # Groups: houseID [3] #  #    houseID  year price #      (int) (int) (int) # 1        1  1995    NA # 2        1  1996   100 # 3        1  1997   100 # 4        1  1998   120 # 5        1  1999   120 # 6        2  1995    NA # 7        2  1996    NA # 8        2  1997    NA # 9        2  1998    30 # 10       2  1999    30 # 11       3  1995    NA # 12       3  1996    44 # 13       3  1997    44 # 14       3  1998    44 # 15       3  1999    44 
like image 99
alistaire Avatar answered Sep 29 '22 12:09

alistaire


These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr) library(zoo)  na.locf2 <- function(x) na.locf(x, na.rm = FALSE) df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup 

giving:

Source: local data frame [15 x 3] Groups: houseID     houseID year price 1        1 1995    NA 2        1 1996   100 3        1 1997   100 4        1 1998   120 5        1 1999   120 6        2 1995    NA 7        2 1996    NA 8        2 1997    NA 9        2 1998    30 10       2 1999    30 11       3 1995    NA 12       3 1996    44 13       3 1997    44 14       3 1998    44 15       3 1999    44 

A variation of this is:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup 

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows 

by

library(zoo)  do.call(rbind, by(df, df$houseID, na.locf2)) 

ave

library(zoo)  transform(df, price = ave(price, houseID, FUN = na.locf0)) 

data.table

library(data.table) library(zoo)  data.table(df)[, na.locf2(.SD), by = houseID] 

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)  z <- read.zoo(df, index = 2, split = 1, FUN = identity) na.locf2(z) 

giving:

       1  2  3 1995  NA NA NA 1996 100 NA 44 1997 100 NA 44 1998 120 30 44 1999 120 30 44 

This solution could be combined with dplyr like this:

library(dplyr) library(zoo)  df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2 

input

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,    2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,    1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,    1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,    30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",    "price"), class = "data.frame", row.names = c(NA, -15L)) 

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

like image 34
G. Grothendieck Avatar answered Sep 29 '22 12:09

G. Grothendieck


You can do a rolling self-join, supported by data.table:

require(data.table)
setDT(df)   ## change it to data.table in place
setkey(df, houseID, year)     ## needed for fast join
df.woNA <- df[!is.na(price)]  ## version without the NA rows

# rolling self-join will return what you want
df.woNA[df, roll=TRUE]  ## will match previous year if year not found
like image 28
ilir Avatar answered Sep 29 '22 10:09

ilir