I would like to solve the following problem with dplyr. Preferable with one of the window-functions. I have a data frame with houses and buying prices. The following is an example:
houseID year price
1 1995 NA
1 1996 100
1 1997 NA
1 1998 120
1 1999 NA
2 1995 NA
2 1996 NA
2 1997 NA
2 1998 30
2 1999 NA
3 1995 NA
3 1996 44
3 1997 NA
3 1998 NA
3 1999 NA
I would like to make a data frame like this:
houseID year price
1 1995 NA
1 1996 100
1 1997 100
1 1998 120
1 1999 120
2 1995 NA
2 1996 NA
2 1997 NA
2 1998 30
2 1999 30
3 1995 NA
3 1996 44
3 1997 44
3 1998 44
3 1999 44
Here are some data in the right format:
# Number of houses
N = 15
# Data frame
df = data.frame(houseID = rep(1:N,each=10), year=1995:2004, price =ifelse(runif(10*N)>0.15, NA,exp(rnorm(10*N))))
Is there a dplyr-way to do that?
locf() function from the zoo package to carry the last observation forward to replace your NA values.
The easiest way to replace NA's with the mean in multiple columns is by using the functions mutate_at() and vars(). These functions let you select the columns in which you want to replace the missing values. To actually replace the NA with the mean, you can use the replace_na() and mean() function.
So, how do you replace missing values with basic R code? To replace the missing values, you first identify the NA's with the is.na() function and the $-operator. Then, you use the min() function to replace the NA's with the lowest value.
The classic way to replace NA's in R is by using the IS.NA() function. The IS.NA() function takes a vector or data frame as input and returns a logical object that indicates whether a value is missing (TRUE or VALUE). Next, you can use this logical object to create a subset of the missing values and assign them a zero.
In R, the easiest way to replace NA’s with the last, non-missing value is by using the fill () function from tidyr package. This function detects and substitutes missing values in a data frame with the last, non-missing value (per group). Alternatively, you can use the na.locf () function or the setnafill () function.
The zoo R package contains the na.locf function, which is a generic function for replacing each NA with the most recent non-NA value prior to it. Let’s do this in practice:
The problems Users often want to replace missing values by neighboring nonmissing values, particularly when observations occur in some definite order, often (but not always) a time order.
If missing values occurred singly, then they could be replaced by the previous value Here the subscript notation used is that _n always refers to any given observation, _n−1 to the previous observation and _n+1 to the following observation, given the current sort order.
tidyr::fill
now makes this stupidly easy:
library(dplyr) library(tidyr) # or library(tidyverse) df %>% group_by(houseID) %>% fill(price) # Source: local data frame [15 x 3] # Groups: houseID [3] # # houseID year price # (int) (int) (int) # 1 1 1995 NA # 2 1 1996 100 # 3 1 1997 100 # 4 1 1998 120 # 5 1 1999 120 # 6 2 1995 NA # 7 2 1996 NA # 8 2 1997 NA # 9 2 1998 30 # 10 2 1999 30 # 11 3 1995 NA # 12 3 1996 44 # 13 3 1997 44 # 14 3 1998 44 # 15 3 1999 44
These all use na.locf
from the zoo package. Also note that na.locf0
(also defined in zoo) is like na.locf
except it defaults to na.rm = FALSE
and requires a single vector argument. na.locf2
defined in the first solution is also used in some of the others.
dplyr
library(dplyr) library(zoo) na.locf2 <- function(x) na.locf(x, na.rm = FALSE) df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup
giving:
Source: local data frame [15 x 3] Groups: houseID houseID year price 1 1 1995 NA 2 1 1996 100 3 1 1997 100 4 1 1998 120 5 1 1999 120 6 2 1995 NA 7 2 1996 NA 8 2 1997 NA 9 2 1998 30 10 2 1999 30 11 3 1995 NA 12 3 1996 44 13 3 1997 44 14 3 1998 44 15 3 1999 44
A variation of this is:
df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup
Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.
Another possibility is to combine the by
solution (shown further below) with dplyr:
df %>% by(df$houseID, na.locf2) %>% bind_rows
by
library(zoo) do.call(rbind, by(df, df$houseID, na.locf2))
ave
library(zoo) transform(df, price = ave(price, houseID, FUN = na.locf0))
data.table
library(data.table) library(zoo) data.table(df)[, na.locf2(.SD), by = houseID]
zoo This solution uses zoo alone. It returns a wide rather than long result:
library(zoo) z <- read.zoo(df, index = 2, split = 1, FUN = identity) na.locf2(z)
giving:
1 2 3 1995 NA NA NA 1996 100 NA 44 1997 100 NA 44 1998 120 30 44 1999 120 30 44
This solution could be combined with dplyr like this:
library(dplyr) library(zoo) df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2
input
Here is the input used for the examples above:
df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year", "price"), class = "data.frame", row.names = c(NA, -15L))
REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2
from all solutions.
You can do a rolling self-join, supported by data.table
:
require(data.table)
setDT(df) ## change it to data.table in place
setkey(df, houseID, year) ## needed for fast join
df.woNA <- df[!is.na(price)] ## version without the NA rows
# rolling self-join will return what you want
df.woNA[df, roll=TRUE] ## will match previous year if year not found
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With