I'm trying to do a Last Observation Carried Forward operation on some poorly formatted data using <code>dplyr</code> and <code>tidyr</code>. It isn't working as I'd expect. <pre class="prettyprint"><code>library(dplyr) library(tidyr) df <- data.frame(id=c(1,1,2,2,3,3), email=c('bob@email.com', NA, 'joe@email.com', NA, NA, NA)) df2 <- df %>% group_by(id) %>% fill(email) </code></pre> This results in: <pre class="prettyprint"><code>Source: local data frame [6 x 2] Groups: id [3] id email (dbl) (fctr) 1 1 bob@email.com 2 1 bob@email.com 3 2 joe@email.com 4 2 joe@email.com 5 3 joe@email.com 6 3 joe@email.com </code></pre> I expect it to be: <pre class="prettyprint"><code>Source: local data frame [6 x 2] Groups: id [3] id email (dbl) (fctr) 1 1 bob@email.com 2 1 bob@email.com 3 2 joe@email.com 4 2 joe@email.com 5 3 NA 6 3 NA </code></pre> The reason I expect it to be the latter is because of <code>group_by</code>'s documentation saying, "The <code>group_by</code> function takes an existing tbl and converts it into a grouped tbl where operations are performed "by group"." The group in this case is determined by the <code>id</code> variable, and the following operation is <code>fill(email)</code>. However, it's pretty clearly NOT doing that. <hr> And before anybody asks, it makes no difference if the fields are both <code>character</code> instead of <code>numeric</code> or <code>factor</code>. <hr> UPDATE @aosmith pointed out this open issue on Github. I'm going to say that there won't be a proper solution to this problem until that issue is resolved. Everything else would just be a workaround. So, if somebody makes a successful PR addressing that issue and posts it here, I'd be happy to mark it as the solution.

Looks like this has been fixed in the development version of tidyr. You now get the expected result per id using <code>fill</code> from tidyr_0.3.1.9000. <pre class="prettyprint"><code>df %>% group_by(id) %>% fill(email) Source: local data frame [6 x 2] Groups: id [3] id email (dbl) (fctr) 1 1 bob@email.com 2 1 bob@email.com 3 2 joe@email.com 4 2 joe@email.com 5 3 NA 6 3 NA </code></pre>

group_by() into fill() not working as expected

Tags:

r

dplyr

tidyr

I'm trying to do a Last Observation Carried Forward operation on some poorly formatted data using dplyr and tidyr. It isn't working as I'd expect.

library(dplyr)
library(tidyr)

df <- data.frame(id=c(1,1,2,2,3,3),
                 email=c('[email protected]', NA, '[email protected]', NA, NA, NA))
df2 <- df %>% group_by(id) %>% fill(email)

This results in:

Source: local data frame [6 x 2]
Groups: id [3]

     id         email
  (dbl)        (fctr)
1     1 [email protected]
2     1 [email protected]
3     2 [email protected]
4     2 [email protected]
5     3 [email protected]
6     3 [email protected]

I expect it to be:

Source: local data frame [6 x 2]
Groups: id [3]

     id         email
  (dbl)        (fctr)
1     1 [email protected]
2     1 [email protected]
3     2 [email protected]
4     2 [email protected]
5     3 NA
6     3 NA

The reason I expect it to be the latter is because of group_by's documentation saying, "The group_by function takes an existing tbl and converts it into a grouped tbl where operations are performed "by group"." The group in this case is determined by the id variable, and the following operation is fill(email). However, it's pretty clearly NOT doing that.

And before anybody asks, it makes no difference if the fields are both character instead of numeric or factor.

UPDATE @aosmith pointed out this open issue on Github. I'm going to say that there won't be a proper solution to this problem until that issue is resolved. Everything else would just be a workaround. So, if somebody makes a successful PR addressing that issue and posts it here, I'd be happy to mark it as the solution.

888

asked Dec 29 '15 19:12

doicomehereoften1

2 Answers

Luckily you can still use zoo::na.locf for this:

df %>% 
    group_by(id) %>% 
    mutate(email = zoo::na.locf(email, na.rm = FALSE))  
# Source: local data frame [6 x 2]
# Groups: id [3]
# 
#      id         email
#   (dbl)        (fctr)
# 1     1 [email protected]
# 2     1 [email protected]
# 3     2 [email protected]
# 4     2 [email protected]
# 5     3            NA
# 6     3            NA

189

answered Oct 21 '22 01:10

Gregor Thomas

Looks like this has been fixed in the development version of tidyr. You now get the expected result per id using fill from tidyr_0.3.1.9000.

df %>% group_by(id) %>% fill(email)

Source: local data frame [6 x 2]
Groups: id [3]

     id         email
  (dbl)        (fctr)
1     1 [email protected]
2     1 [email protected]
3     2 [email protected]
4     2 [email protected]
5     3            NA
6     3            NA

answered Oct 21 '22 01:10

aosmith

Related questions
                            
                                How to read csv data with unknown encoding in R
                            
                                shapiro.test(..) cannot deal with more than 5000 data points
                            
                                rCharts with Highcharts as shiny application
                            
                                Legend of a raster map with categorical data
                            
                                melt multiple groups of measure.vars
                            
                                R: Avoid accidently overwriting variables
                            
                                05:00:00 - 28:59:59 time format
                            
                                NumPy percentile function different from MATLAB's percentile function
                            
                                Cannot use dput for data.table in R
                            
                                R: Reorder facet_wrapped x-axis with free_x in ggplot2
                            
                                How to order data within subgroups in data.table R
                            
                                Different colour palettes for two different colour aesthetic mappings in ggplot2
                            
                                Why is zoo::rollmean slow compared to a simple Rcpp implementation?
                            
                                How to hide figures in knitr, but create them as png?
                            
                                R data.table: subgroup weighted percent of group
                            
                                How to check if a filename is writeable in R?
                            
                                dplyr mutate using rbinom do not return random numbers
                            
                                Plotting POSIXct timestamp series with ggplot2
                            
                                nls troubles: Missing value or an infinity produced when evaluating the model
                            
                                Filter groups in dplyr that exclusively contain specific combinations of values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With