How to do the operation
df1 %>% spread(groupid, value, fill = 0) %>% gather(groupid, value, one, two)
in a more natural way?
Given a data frame
df1 <- data.frame(groupid = c("one","one","one","two","two","two", "one"),
value = c(3,2,1,2,3,1,22),
itemid = c(1:6, 6))
for many itemid and groupid pairs we have a value, for some itemids there are groupids where there is no value. I want to add a default value for those cases. E.g. for the itemid 1 and groupid "two" there is no value, I want to add a row where this gets a default value.
The following tidyr code achieves this, but it feels like a strange way to do it (the default value added here is 0).
df1 %>% spread(groupid, value, fill = 0) %>% gather(groupid, value, one, two)
I am looking for suggestions on how to do this in a more natural way.
Since in some weeks looking at the above code I would likely be confused about its effect I wrote a function wrapping it:
#' Add default values for missing groups
#'
#' Given data about items where each item is identified by an id, and every
#' item can have a value in every group; add a default value for all groups
#' where an item doesn't have a value yet.
add_default_value <- function(data, id, group, value, default) {
id = as.character(substitute(id))
group = as.character(substitute(group))
value = as.character(substitute(value))
groups <- unique(as.character(data[[group]]))
# spread checks that the columns outside of group and value uniquely
# determine the row. Here we check that that already is the case within
# each group using only id. I.e. there is no repeated (id, group).
id_group_cts <- data %>% group_by_(id, group) %>% do(data.frame(.ct = nrow(.)))
if (any(id_group_cts$.ct > 1)) {
badline <- id_group_cts %>% filter(.ct > 1) %>% top_n(1, .ct)
stop("There is at least one (", id, ", ", group, ")",
" combination with two members: (",
as.character(badline[[id]]), ", ", as.character(badline[[group]]), ")")
}
gather_(spread_(data, group, value, fill = default), group, value, groups)
}
Last note: reason for wanting this is, my groups are ordered (week1, week2, ...) and I am looking to have every id have a value in every group so that after sorting the groups per id I can use cumsum to get a weekly running total that is also shown in the weeks where the running total didn't increase.
One possibility is to use expand from tidyr. This approach is very similar to the expand.grid idea of @akrun (it actually uses expand.grid internally). I used the dplyr package for the housekeeping after joining the expanded data with the original data.
This approach is longer than the spread/gather approach. Personally I find it a bit more clear what is going on. In my (rather small) benchmark test, spread/gather performed slightly better than expand/join.
# expand first
expand(df1, itemid, groupid) %>%
# then join back to data
left_join(df1, by = c("itemid", "groupid")) %>%
# because there is no fill argument in join
mutate(value = ifelse(is.na(value), 0, value)) %>%
# rearange
arrange(groupid, itemid)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With