Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way of simultaneously deriving count of unique values and summary values for grouped values in dplyr

I'm interested in finding an efficient manner to obtain a summary by group table that would contain:

  • Count for unique values per group
  • A primitive set of descriptive statistics for selected variables

For example, in case of generating the descriptive statistics I'm using the code below:

data("mtcars")
require(dplyr)
mt_sum <- mtcars %>% 
  group_by(cyl) %>% 
  summarise_each(funs(min,max), hp, wt, disp)

which would generate the desired output:

> head(mt_sum)

Source: local data frame [3 x 7]

    cyl hp_min wt_min disp_min hp_max wt_max disp_max
  (dbl)  (dbl)  (dbl)    (dbl)  (dbl)  (dbl)    (dbl)
1     4     52  1.513     71.1    113  3.190    146.7
2     6    105  2.620    145.0    175  3.460    258.0
3     8    150  3.170    275.8    335  5.424    472.0

I'm interested in enriching the data with figure that would reflect count of values per each group. With respect to the count, this can be simply done:

mt_sum2 <- mtcars %>% 
  group_by(cyl) %>% 
  summarise(countObs = n())

which would generate the required data:

> head(mt_sum2)
Source: local data frame [3 x 2]

    cyl countObs
  (dbl)    (int)
1     4       11
2     6        7
3     8       14 

Problem

The problem occurs when I would like to simultaneously apply both transformations.

Attempt 1

For example the code:

mt_sum <- mtcars %>% 
  group_by(cyl) %>% 
  summarise_each(funs(min,max), hp, wt, disp) %>% 
  summarise(countObs = n())

would generate:

Source: local data frame [3 x 2]

    cyl countObs
  (dbl)    (int)
1     4       11
2     6        7
3     8       14

without the descriptive statistics that were previously generated.

Attempt 2

The code:

mt_sum <- mtcars %>% 
  group_by(cyl) %>% 
  summarise_each(funs(min,max,n), hp, wt, disp)

will expectedly fail:

Error: n does not take arguments

Attempt 3 (working)

The code:

data("mtcars")
require(dplyr)
mt_sum <- mtcars %>% 
  group_by(cyl) %>% 
  summarise_each(funs(min,max), hp, wt, disp) %>% 
  left_join(y = data.frame(
    "Var1" = as.numeric(as.character(as.data.frame(table(mtcars$cyl))$Var1)),
    "Count" = as.character(as.data.frame(table(mtcars$cyl))$Freq)),
            by = c("cyl" = "Var1"))

will deliver the required data:

> head(mt_sum)
Source: local data frame [3 x 8]

    cyl hp_min wt_min disp_min hp_max wt_max disp_max  Count
  (dbl)  (dbl)  (dbl)    (dbl)  (dbl)  (dbl)    (dbl) (fctr)
1     4     52  1.513     71.1    113  3.190    146.7     11
2     6    105  2.620    145.0    175  3.460    258.0      7
3     8    150  3.170    275.8    335  5.424    472.0     14

I think this is extremely inefficient way of producing this summary. In particular, creating objects on the fly is inefficient when working with big tables. I'm interested in achieving the same results but in a more efficient manner that would not involve creating objects just for the purpose of merging. In particular, what I would like to do in dplyr would correspond to deriving additional summaries from the previous version of the table. For example:

  1. Group
  2. Produce descriptive statistics
  3. Come back to the data after group
  4. Produce some additional statistics and add to the final data
like image 713
Konrad Avatar asked Oct 19 '22 20:10

Konrad


1 Answers

Here's another (shorter) option using a left_join:

mtcars %>% 
    group_by(cyl) %>%  
    summarise_each(funs(min,max), hp, wt, disp) %>% 
    left_join(count(mtcars, cyl))
#Joining by: "cyl"
#Source: local data frame [3 x 8]
#
#    cyl hp_min wt_min disp_min hp_max wt_max disp_max     n
#  (dbl)  (dbl)  (dbl)    (dbl)  (dbl)  (dbl)    (dbl) (int)
#1     4     52  1.513     71.1    113  3.190    146.7    11
#2     6    105  2.620    145.0    175  3.460    258.0     7
#3     8    150  3.170    275.8    335  5.424    472.0    14
like image 156
talat Avatar answered Oct 30 '22 23:10

talat