Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to bin the summarised frequency table with dplyr

Tags:

r

dplyr

tidyverse

I have the following data frame:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- nycflights13::flights %>% 
  select(distance) %>% 
  group_by(distance) %>% 
  summarise(n = n()) %>% 
  arrange(distance) %>% ungroup() 

df
#> # A tibble: 214 x 2
#>    distance     n
#>       <dbl> <int>
#>  1       17     1
#>  2       80    49
#>  3       94   976
#>  4       96   607
#>  5      116   443
#>  6      143   439
#>  7      160   376
#>  8      169   545
#>  9      173   221
#> 10      184  5504
#> # … with 204 more rows

What I want to do is to bin the distance column by bin of size 100, and also summing the n column accordingly. How can do that?

So you get something like:

bin_distance sum_n
1-100       1633  #(1 + 49 + 976 + 607)
101-200     21344 # (443 + ... + 5327)
#etc
like image 850
scamander Avatar asked Dec 18 '22 19:12

scamander


1 Answers

The most simple approach would be to use cut by creating groups using seq for every 100 values and sum the values for each group.

library(dplyr)

df %>%
  group_by(group = cut(distance, breaks = seq(0, max(distance), 100))) %>%
  summarise(n = sum(n))


#   group         n
#   <fct>       <int>
# 1 (0,100]      1633
# 2 (100,200]   21344
# 3 (200,300]   28310
# 4 (300,400]    7748
# 5 (400,500]   21292
# 6 (500,600]   26815
# 7 (600,700]    7846
# 8 (700,800]   48904
# 9 (800,900]    7574
#10 (900,1e+03] 18205
# ... with 17 more rows

which can be translated to base R using aggregate like

aggregate(n ~ distance, 
 transform(df, distance = cut(distance, breaks = seq(0, max(distance), 100))), sum)
like image 80
Ronak Shah Avatar answered Jan 07 '23 04:01

Ronak Shah