I would like to solve the following problem with dplyr. Preferable with one of the window-functions. I have a data frame with houses and buying prices. The following is an example: <pre class="prettyprint"><code>houseID year price 1 1995 NA 1 1996 100 1 1997 NA 1 1998 120 1 1999 NA 2 1995 NA 2 1996 NA 2 1997 NA 2 1998 30 2 1999 NA 3 1995 NA 3 1996 44 3 1997 NA 3 1998 NA 3 1999 NA </code></pre> I would like to make a data frame like this: <pre class="prettyprint"><code>houseID year price 1 1995 NA 1 1996 100 1 1997 100 1 1998 120 1 1999 120 2 1995 NA 2 1996 NA 2 1997 NA 2 1998 30 2 1999 30 3 1995 NA 3 1996 44 3 1997 44 3 1998 44 3 1999 44 </code></pre> Here are some data in the right format: <pre class="prettyprint"><code># Number of houses N = 15 # Data frame df = data.frame(houseID = rep(1:N,each=10), year=1995:2004, price =ifelse(runif(10*N)>0.15, NA,exp(rnorm(10*N)))) </code></pre> Is there a dplyr-way to do that?

These all use <code>na.locf</code> from the zoo package. Also note that <code>na.locf0</code> (also defined in zoo) is like <code>na.locf</code> except it defaults to <code>na.rm = FALSE</code> and requires a single vector argument. <code>na.locf2</code> defined in the first solution is also used in some of the others. dplyr <pre class="prettyprint"><code>library(dplyr) library(zoo) na.locf2 <- function(x) na.locf(x, na.rm = FALSE) df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup </code></pre> giving: <pre class="prettyprint"><code>Source: local data frame [15 x 3] Groups: houseID houseID year price 1 1 1995 NA 2 1 1996 100 3 1 1997 100 4 1 1998 120 5 1 1999 120 6 2 1995 NA 7 2 1996 NA 8 2 1997 NA 9 2 1998 30 10 2 1999 30 11 3 1995 NA 12 3 1996 44 13 3 1997 44 14 3 1998 44 15 3 1999 44 </code></pre> A variation of this is: <pre class="prettyprint"><code>df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup </code></pre> Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially. Another possibility is to combine the <code>by</code> solution (shown further below) with dplyr: <pre class="prettyprint"><code>df %>% by(df$houseID, na.locf2) %>% bind_rows </code></pre> by <pre class="prettyprint"><code>library(zoo) do.call(rbind, by(df, df$houseID, na.locf2)) </code></pre> ave <pre class="prettyprint"><code>library(zoo) transform(df, price = ave(price, houseID, FUN = na.locf0)) </code></pre> data.table <pre class="prettyprint"><code>library(data.table) library(zoo) data.table(df)[, na.locf2(.SD), by = houseID] </code></pre> zoo This solution uses zoo alone. It returns a wide rather than long result: <pre class="prettyprint"><code>library(zoo) z <- read.zoo(df, index = 2, split = 1, FUN = identity) na.locf2(z) </code></pre> giving: <pre class="prettyprint"><code> 1 2 3 1995 NA NA NA 1996 100 NA 44 1997 100 NA 44 1998 120 30 44 1999 120 30 44 </code></pre> This solution could be combined with dplyr like this: <pre class="prettyprint"><code>library(dplyr) library(zoo) df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2 </code></pre> input Here is the input used for the examples above: <pre class="prettyprint"><code>df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year", "price"), class = "data.frame", row.names = c(NA, -15L)) </code></pre> REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out <code>na.locf2</code> from all solutions.

You can do a rolling self-join, supported by <code>data.table</code>: <pre class="prettyprint"><code>require(data.table) setDT(df) ## change it to data.table in place setkey(df, houseID, year) ## needed for fast join df.woNA <- df[!is.na(price)] ## version without the NA rows # rolling self-join will return what you want df.woNA[df, roll=TRUE] ## will match previous year if year not found </code></pre>

Replace missing values (NA) with most recent non-NA by group

Tags:

r

dplyr

I would like to solve the following problem with dplyr. Preferable with one of the window-functions. I have a data frame with houses and buying prices. The following is an example:

houseID      year    price 
1            1995    NA
1            1996    100
1            1997    NA
1            1998    120
1            1999    NA
2            1995    NA
2            1996    NA
2            1997    NA
2            1998    30
2            1999    NA
3            1995    NA
3            1996    44
3            1997    NA
3            1998    NA
3            1999    NA

I would like to make a data frame like this:

houseID      year    price 
1            1995    NA
1            1996    100
1            1997    100
1            1998    120
1            1999    120
2            1995    NA
2            1996    NA
2            1997    NA
2            1998    30
2            1999    30
3            1995    NA
3            1996    44
3            1997    44
3            1998    44
3            1999    44

Here are some data in the right format:

# Number of houses
N = 15

# Data frame
df = data.frame(houseID = rep(1:N,each=10), year=1995:2004, price =ifelse(runif(10*N)>0.15, NA,exp(rnorm(10*N))))

Is there a dplyr-way to do that?

323

asked Apr 28 '14 11:04

Peter Stephensen

3 Answers

tidyr::fill now makes this stupidly easy:

library(dplyr) library(tidyr) # or library(tidyverse)  df %>% group_by(houseID) %>% fill(price) # Source: local data frame [15 x 3] # Groups: houseID [3] #  #    houseID  year price #      (int) (int) (int) # 1        1  1995    NA # 2        1  1996   100 # 3        1  1997   100 # 4        1  1998   120 # 5        1  1999   120 # 6        2  1995    NA # 7        2  1996    NA # 8        2  1997    NA # 9        2  1998    30 # 10       2  1999    30 # 11       3  1995    NA # 12       3  1996    44 # 13       3  1997    44 # 14       3  1998    44 # 15       3  1999    44

answered Sep 29 '22 12:09

alistaire

These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr) library(zoo)  na.locf2 <- function(x) na.locf(x, na.rm = FALSE) df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

giving:

Source: local data frame [15 x 3] Groups: houseID     houseID year price 1        1 1995    NA 2        1 1996   100 3        1 1997   100 4        1 1998   120 5        1 1999   120 6        2 1995    NA 7        2 1996    NA 8        2 1997    NA 9        2 1998    30 10       2 1999    30 11       3 1995    NA 12       3 1996    44 13       3 1997    44 14       3 1998    44 15       3 1999    44

A variation of this is:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows

library(zoo)  do.call(rbind, by(df, df$houseID, na.locf2))

ave

library(zoo)  transform(df, price = ave(price, houseID, FUN = na.locf0))

data.table

library(data.table) library(zoo)  data.table(df)[, na.locf2(.SD), by = houseID]

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)  z <- read.zoo(df, index = 2, split = 1, FUN = identity) na.locf2(z)

giving:

       1  2  3 1995  NA NA NA 1996 100 NA 44 1997 100 NA 44 1998 120 30 44 1999 120 30 44

This solution could be combined with dplyr like this:

library(dplyr) library(zoo)  df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

input

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,    2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,    1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,    1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,    30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",    "price"), class = "data.frame", row.names = c(NA, -15L))

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

answered Sep 29 '22 12:09

G. Grothendieck

You can do a rolling self-join, supported by data.table:

require(data.table)
setDT(df)   ## change it to data.table in place
setkey(df, houseID, year)     ## needed for fast join
df.woNA <- df[!is.na(price)]  ## version without the NA rows

# rolling self-join will return what you want
df.woNA[df, roll=TRUE]  ## will match previous year if year not found

answered Sep 29 '22 10:09

ilir

Related questions
                            
                                How do I open a script file in RStudio using an R command?
                            
                                How to annotate() ggplot with latex
                            
                                Subset rows in a data frame based on a vector of values
                            
                                Fill and border colour in geom_point (scale_colour_manual) in ggplot
                            
                                Grouped bar plot in ggplot
                            
                                How can I count runs in a sequence?
                            
                                Replace values in a dataframe based on lookup table
                            
                                heatmap with values (ggplot2)
                            
                                Put whisker ends on boxplot
                            
                                Aggregate multiple columns at once [duplicate]
                            
                                preallocate list in R
                            
                                Unimplemented type list when trying to write.table
                            
                                Parsing command line arguments in R scripts
                            
                                Invalid multibyte string in read.csv
                            
                                Replace characters from a column of a data frame R
                            
                                How to plot just the legends in ggplot2?
                            
                                R: Find the last dot in a string
                            
                                Save multiple ggplots using a for loop
                            
                                Create an ID (row number) column
                            
                                How do you convert dates/times from one time zone to another in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With