I am seeing what looks like strange behavior of the do
function in dplyr
0.3.0.2, but perhaps I'm misunderstanding something.
I have a data frame that looks like
set.seed(668)
stuff <- data.frame(name=c(rep("Frodzak", 5), rep("Dumpf", 4), rep("Ackpth", 6)),
state=c("AL", "AK", "AL", "KS", "OR", "LA", "MS", "KY", "FL",
"NY", "NY", "NJ", "PA", "NJ", "NY"),
important=c(F, F, T, F, F, T, F, F, F, T, F, F, F, F, F),
girth=rnorm(15, 250, 80), stringsAsFactors=F)
stuff
name state important girth
1 Frodzak AL FALSE 148.5870
2 Frodzak AK FALSE 321.4144
3 Frodzak AL TRUE 224.8380
4 Frodzak KS FALSE 315.9416
5 Frodzak OR FALSE 331.4336
6 Dumpf LA TRUE 317.4794
7 Dumpf MS FALSE 170.4174
8 Dumpf KY FALSE 275.4033
9 Dumpf FL FALSE 240.9276
10 Ackpth NY TRUE 145.6290
11 Ackpth NY FALSE 267.6902
12 Ackpth NJ FALSE 171.4015
13 Ackpth PA FALSE 298.5841
14 Ackpth NJ FALSE 249.5764
15 Ackpth NY FALSE 276.5504
In my application, there will be exactly one TRUE
in the "important" column for each group of rows with the same "name". I want to subset the df so as to include only those rows where the state matches the state of the "important" row (within each "name" group). In other words, I want to get
name state important girth
1 Ackpth NY TRUE 145.6290
2 Ackpth NY FALSE 267.6902
3 Ackpth NY FALSE 276.5504
4 Dumpf LA TRUE 317.4794
5 Frodzak AL FALSE 148.5870
6 Frodzak AL TRUE 224.8380
If I run the following:
importantState <- function(df) {
impst <- df[df$important, "state"]
if (length(impst) != 1) stop("group does not have one 'important'")
impst
}
stuff %>% group_by(name) %>% do(.[.$state == importantState(.), ])
In dplyr 0.2
I get exactly what I expect (the above 6-row subset). However, if I run the exact same code using dplyr 0.3.0.2
it returns the entire original df (all 15 rows).
I looked at the 0.3 release notes on github, but I don't see anything that would relate to a change in the substantive behavior in do
.
Can somebody help me recover at least a little of my sanity by explaining what in heaven's name is going on here? Or any ideas for a creative work-around that I haven't thought of?
Perhaps you could try filter
here?
stuff %>%
group_by(name) %>%
filter(state == state[important])
# name state important girth
# 1 Frodzak AL FALSE 148.5870
# 2 Frodzak AL TRUE 224.8380
# 3 Dumpf LA TRUE 317.4794
# 4 Ackpth NY TRUE 145.6290
# 5 Ackpth NY FALSE 267.6902
# 6 Ackpth NY FALSE 276.5504
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With