Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr: Arrange not behaving as expected after group_by and summarize

Tags:

r

dplyr

I must be missing something with how group_by levels in dplyr get peeled off. In the example below, I group by 2 columns, summarize values into a single variable, then sort by that new variable:

mtcars %>% group_by( cyl, gear ) %>% 
  summarize( hp_range = max(hp) - min(mpg)) %>% 
  arrange( desc(hp_range) )

# Source: local data frame [8 x 3]
# Groups: cyl [3]
#
#    cyl  gear  hp_range
#  (dbl) (dbl) (dbl)
#1     4     4  87.6
#2     4     5  87.0
#3     4     3  75.5
#4     6     5 155.3
#5     6     4 105.2
#6     6     3  91.9
#7     8     5 320.0
#8     8     3 234.6

Obviously this is not sorted by hp_range as intended. What am I missing?

EDIT: The example works as expected without the call to desc in arrange. Still unclear why?

like image 820
zimmeee Avatar asked Sep 07 '15 22:09

zimmeee


1 Answers

Ok, just got to the bottom of this:

  1. The call to desc had no effect, it was by chance that the example did not work without it
  2. The key is that when you group_by multiple columns, it seems that results are automatically sorted by the Groups. In the example above it is sorted by cyl. To get the intended sort of the entire data table, you must first ungroup and then arrange

    mtcars %>% group_by( cyl, gear ) %>% 
       summarize( hp_range = max(hp) - min(mpg)) %>% 
       ungroup() %>% 
       arrange( hp_range )
    
like image 146
zimmeee Avatar answered Sep 21 '22 22:09

zimmeee