Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

group_by() into fill() not working as expected

Tags:

r

dplyr

tidyr

I'm trying to do a Last Observation Carried Forward operation on some poorly formatted data using dplyr and tidyr. It isn't working as I'd expect.

library(dplyr)
library(tidyr)

df <- data.frame(id=c(1,1,2,2,3,3),
                 email=c('[email protected]', NA, '[email protected]', NA, NA, NA))
df2 <- df %>% group_by(id) %>% fill(email)

This results in:

Source: local data frame [6 x 2]
Groups: id [3]

     id         email
  (dbl)        (fctr)
1     1 [email protected]
2     1 [email protected]
3     2 [email protected]
4     2 [email protected]
5     3 [email protected]
6     3 [email protected]

I expect it to be:

Source: local data frame [6 x 2]
Groups: id [3]

     id         email
  (dbl)        (fctr)
1     1 [email protected]
2     1 [email protected]
3     2 [email protected]
4     2 [email protected]
5     3 NA
6     3 NA

The reason I expect it to be the latter is because of group_by's documentation saying, "The group_by function takes an existing tbl and converts it into a grouped tbl where operations are performed "by group"." The group in this case is determined by the id variable, and the following operation is fill(email). However, it's pretty clearly NOT doing that.


And before anybody asks, it makes no difference if the fields are both character instead of numeric or factor.


UPDATE @aosmith pointed out this open issue on Github. I'm going to say that there won't be a proper solution to this problem until that issue is resolved. Everything else would just be a workaround. So, if somebody makes a successful PR addressing that issue and posts it here, I'd be happy to mark it as the solution.

like image 888
doicomehereoften1 Avatar asked Dec 29 '15 19:12

doicomehereoften1


People also ask

What's the point of using Group_by ()?

Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group".

What is fill in R?

Source: R/fill.R. fill.Rd. Fills missing values in selected columns using the next or previous entry. This is useful in the common output format where values are not repeated, and are only recorded when they change.

How does Dplyr Group_by work?

Groupby Function in R – group_by is used to group the dataframe in R. Dplyr package in R is provided with group_by() function which groups the dataframe by multiple columns with mean, sum and other functions like count, maximum and minimum.

Is there a Groupby function in R?

Group_by() function belongs to the dplyr package in the R programming language, which groups the data frames. Group_by() function alone will not give any output. It should be followed by summarise() function with an appropriate action to perform. It works similar to GROUP BY in SQL and pivot table in excel.


2 Answers

Luckily you can still use zoo::na.locf for this:

df %>% 
    group_by(id) %>% 
    mutate(email = zoo::na.locf(email, na.rm = FALSE))  
# Source: local data frame [6 x 2]
# Groups: id [3]
# 
#      id         email
#   (dbl)        (fctr)
# 1     1 [email protected]
# 2     1 [email protected]
# 3     2 [email protected]
# 4     2 [email protected]
# 5     3            NA
# 6     3            NA
like image 189
Gregor Thomas Avatar answered Oct 21 '22 01:10

Gregor Thomas


Looks like this has been fixed in the development version of tidyr. You now get the expected result per id using fill from tidyr_0.3.1.9000.

df %>% group_by(id) %>% fill(email)

Source: local data frame [6 x 2]
Groups: id [3]

     id         email
  (dbl)        (fctr)
1     1 [email protected]
2     1 [email protected]
3     2 [email protected]
4     2 [email protected]
5     3            NA
6     3            NA
like image 26
aosmith Avatar answered Oct 21 '22 01:10

aosmith