Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

odd behavior of do() function in dplyr

Tags:

r

dplyr

I am seeing what looks like strange behavior of the do function in dplyr 0.3.0.2, but perhaps I'm misunderstanding something.

I have a data frame that looks like

set.seed(668)
stuff <- data.frame(name=c(rep("Frodzak", 5), rep("Dumpf", 4), rep("Ackpth", 6)), 
                state=c("AL", "AK", "AL", "KS", "OR", "LA", "MS", "KY", "FL",
                        "NY", "NY", "NJ", "PA", "NJ", "NY"),
                important=c(F, F, T, F, F, T, F, F, F, T, F, F, F, F, F),
                girth=rnorm(15, 250, 80), stringsAsFactors=F)


stuff

      name state important    girth

1  Frodzak    AL     FALSE 148.5870
2  Frodzak    AK     FALSE 321.4144
3  Frodzak    AL      TRUE 224.8380
4  Frodzak    KS     FALSE 315.9416
5  Frodzak    OR     FALSE 331.4336
6    Dumpf    LA      TRUE 317.4794
7    Dumpf    MS     FALSE 170.4174
8    Dumpf    KY     FALSE 275.4033
9    Dumpf    FL     FALSE 240.9276
10  Ackpth    NY      TRUE 145.6290
11  Ackpth    NY     FALSE 267.6902
12  Ackpth    NJ     FALSE 171.4015
13  Ackpth    PA     FALSE 298.5841
14  Ackpth    NJ     FALSE 249.5764
15  Ackpth    NY     FALSE 276.5504

In my application, there will be exactly one TRUE in the "important" column for each group of rows with the same "name". I want to subset the df so as to include only those rows where the state matches the state of the "important" row (within each "name" group). In other words, I want to get

     name state important    girth
1  Ackpth    NY      TRUE 145.6290
2  Ackpth    NY     FALSE 267.6902
3  Ackpth    NY     FALSE 276.5504
4   Dumpf    LA      TRUE 317.4794
5 Frodzak    AL     FALSE 148.5870   
6 Frodzak    AL      TRUE 224.8380 

If I run the following:

importantState <- function(df) {
  impst <- df[df$important, "state"]
  if (length(impst) != 1) stop("group does not have one 'important'")
  impst
}

stuff %>% group_by(name) %>% do(.[.$state == importantState(.), ]) 

In dplyr 0.2 I get exactly what I expect (the above 6-row subset). However, if I run the exact same code using dplyr 0.3.0.2 it returns the entire original df (all 15 rows).

I looked at the 0.3 release notes on github, but I don't see anything that would relate to a change in the substantive behavior in do.

Can somebody help me recover at least a little of my sanity by explaining what in heaven's name is going on here? Or any ideas for a creative work-around that I haven't thought of?

like image 404
NumerousHats Avatar asked Oct 31 '22 12:10

NumerousHats


1 Answers

Perhaps you could try filter here?

stuff %>%
  group_by(name) %>%
  filter(state == state[important])

#      name state important    girth
# 1 Frodzak    AL     FALSE 148.5870
# 2 Frodzak    AL      TRUE 224.8380
# 3   Dumpf    LA      TRUE 317.4794
# 4  Ackpth    NY      TRUE 145.6290
# 5  Ackpth    NY     FALSE 267.6902
# 6  Ackpth    NY     FALSE 276.5504
like image 65
Henrik Avatar answered Nov 15 '22 05:11

Henrik