Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use tidyr to fill in completed rows within each value of a grouping variable?

Tags:

r

tidyr

Say I have data on people who choose between several options. I have one row per person, and I want to have one row per person and choice option. So, if I have 10 people who have 3 choices, right now I have 10 rows, and I want to have 30.

All of the other variables should be copied to each of the new rows. So, for example, if I have a variable for gender, that should be constant within ID. (I am setting my data up this way to analyze with mnlogit.)

This seems like the situation that two tidyr functions, complete and fill, were designed for. To use a simple example:

library(lubridate)
library(tidyr)
dat <- data.frame(
    id = 1:3,
    choice = 5:7,
    c = c(9, NA, 11),
    d = ymd(NA, "2015-09-30", "2015-09-29")
    )

dat %>% 
  complete(id, choice) %>%
  fill(everything())

# Source: local data frame [9 x 4]
# 
#      id choice     c          d
#   (int)  (int) (dbl)     (time)
# 1     1      5     9       <NA>
# 2     1      6     9       <NA>
# 3     1      7     9       <NA>
# 4     2      5     9       <NA>
# 5     2      6     9 2015-09-30
# 6     2      7     9 2015-09-30
# 7     3      5     9 2015-09-30
# 8     3      6     9 2015-09-30
# 9     3      7    11 2015-09-29

But this has some problems -- the values of d were carried forward correctly, but the values of c from ID 1 replaced the (correct) NA values for ID 2.

I could try a workaround, like replacing all of the missing values with 999, running complete and fill, and then replacing 999 with NA. (I think I would have to convert the date variables to character variables and then convert them back again if I go this route.) But maybe someone on here knows of a tidy way to do this with tidyr?

Edit: the desired output here is:

# Source: local data frame [9 x 4]
# 
#     id     c          d choice
#  (int) (dbl)     (time)  (int)
# 1     1     9       <NA>      5
# 2     1     9       <NA>      6
# 3     1     9       <NA>      7
# 4     2    NA 2015-09-30      5
# 5     2    NA 2015-09-30      6
# 6     2    NA 2015-09-30      7
# 7     3    11 2015-09-29      5
# 8     3    11 2015-09-29      6
# 9     3    11 2015-09-29      7
like image 312
Jake Fisher Avatar asked Sep 30 '15 19:09

Jake Fisher


1 Answers

As an update to @jeremycg answer. From tidyr 0.5.1 (or maybe even version 0.4.0) onwards c() does not work anymore. Use nesting() instead:

dat %>% 
 complete(nesting(id, c, d), choice) 

Note I was trying to edit @jeremycg answer, since the answer was correct at the time it was written (and hence a new answer is not really necessary) but unfortunately the edit got rejected.

like image 175
Manuel R Avatar answered Sep 20 '22 13:09

Manuel R