Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to summarize data by-group, by creating dummy variables as the collapsing method

I'm trying to summarize a dataset by groups, to have dummy columns for whether each group's values appear among the data's ungrouped most frequent values.

As an example, let's take flights data from nycflights13.

library(dplyr, warn.conflicts = FALSE)
library(nycflights13)

my_flights_raw <-
  flights %>%
  select(carrier, month, dest)

my_flights_raw
#> # A tibble: 336,776 x 3
#>    carrier month dest 
#>    <chr>   <int> <chr>
#>  1 UA          1 IAH  
#>  2 UA          1 IAH  
#>  3 AA          1 MIA  
#>  4 B6          1 BQN  
#>  5 DL          1 ATL  
#>  6 UA          1 ORD  
#>  7 B6          1 FLL  
#>  8 EV          1 IAD  
#>  9 B6          1 MCO  
#> 10 AA          1 ORD  
#> # ... with 336,766 more rows

My end-goal: I'm interested to know about each carrier in each month: whether it flew to the most popular destinations. I define "most popular" by the top-5 most frequent dest values in each month, then intersecting all months' top-5s.

step 1
I start by simple aggregation by months:

my_flights_agg <- 
  my_flights_raw %>%
  count(month, dest, name = "n_obs") %>%
  arrange(month, desc(n_obs)) 

my_flights_agg
#> # A tibble: 1,113 x 3
#>    month dest  n_obs
#>    <int> <chr> <int>
#>  1     1 ATL    1396
#>  2     1 ORD    1269
#>  3     1 BOS    1245
#>  4     1 MCO    1175
#>  5     1 FLL    1161
#>  6     1 LAX    1159
#>  7     1 CLT    1058
#>  8     1 MIA     981
#>  9     1 SFO     889
#> 10     1 DCA     865
#> # ... with 1,103 more rows

step 2
And now I'm going to cut the data to keep only the top 5 most popular per month.

my_flights_top_5_by_month <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5)

my_flights_top_5_by_month
#> # A tibble: 60 x 3
#> # Groups:   month [12]
#>    month dest  n_obs
#>    <int> <chr> <int>
#>  1     1 ATL    1396
#>  2     1 ORD    1269
#>  3     1 BOS    1245
#>  4     1 MCO    1175
#>  5     1 FLL    1161
#>  6     2 ATL    1267
#>  7     2 ORD    1197
#>  8     2 BOS    1182
#>  9     2 MCO    1110
#> 10     2 FLL    1073
#> # ... with 50 more rows

step 3
Now simply get the unique() of my_flights_top_5_by_month$dest:

my_flights_top_dest_across_months <- unique(my_flights_top_5_by_month$dest)

## [1] "ATL" "ORD" "BOS" "MCO" "FLL" "LAX" "SFO" "CLT"

Here's my question: given my_flights_top_dest_across_months, how can I summarize my_flights_raw to distinct carrier & month, such that the collapsing principle is whether each combination of carrier & month had flawn to each of the dest values in my_flights_top_dest_across_months?

desired output

##    carrier month ATL   ORD   BOS   MCO   FLL   LAX   SFO   CLT  
##    <chr>   <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1 9E          1 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  2 9E          2 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  3 9E          3 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  4 9E          4 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  5 9E          5 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  6 9E          6 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  7 9E          7 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  8 9E          8 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  9 9E          9 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
## 10 9E         10 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
## # ... with 175 more rows

I currently have the following code that is simply inefficient. It works fine for the example flights data, but is taking forever when applied on a large dataset (with several millions rows and groups). Any idea how the task described above can be done more efficiently?

# too slow :(
op_slow_output <- 
  my_flights_raw %>%
  group_by(carrier, month) %>%
  summarise(destinations_vec = list(unique(dest))) %>%
  add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
  mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>%
  mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month))  )) %>%
  tidyr::unnest_wider(are_top_dest_included)
like image 782
Emman Avatar asked Nov 04 '21 22:11

Emman


People also ask

What are the methods of creating dummy variables?

There are two steps to successfully set up dummy variables in a multiple regression: (1) create dummy variables that represent the categories of your categorical independent variable; and (2) enter values into these dummy variables – known as dummy coding – to represent the categories of the categorical independent ...

How do you group and summarize data in Python?

The basic syntax that we’ll use to group and summarize data is as follows: data %>% group_by(col_name) %>% summarize(summary_name = summary_function) Note: The functions summarize () and summarise () are equivalent. Example 1: Find Mean & Median by Group

How do I Group and summarize data in R?

Two of the most common tasks that you’ll perform in data analysis are grouping and summarizing data. Fortunately the dplyr package in R allows you to quickly group and summarize data. This tutorial provides a quick guide to getting started with dplyr. Before you can use the functions in the dplyr package, you must first load the package:

Can a summary statistic be realized among multiple groups?

A summary statistic can be realized among multiple groups. group_by (yearID, teamID): Group by year and team Before you intend to do an operation, you can filter the dataset. The dataset starts in 1871, and the analysis does not need the years prior to 1980.

How do you summarize a sample of data?

You should always summarize a sample of data values to make them more easily understood (by you and others). At the very least you need to show: Middle value – centrality, that is, an average. Dispersion – how spread out the data are around the average. Replication – how large the sample is.


Video Answer


4 Answers

It is quite possible that using the data.table library will be faster here. I will not argue. But I have mastered dplyr and would like to offer a pretty cool solution using the functions from this particular library.

First, let's prepare two little auxiliary functions. We will see how they work later.

library(nycflights13)
library(tidyverse)


ftopDest = function(data, ntop){
  data %>% 
    group_by(dest) %>% 
    summarise(ndest = n()) %>% 
    arrange(desc(ndest)) %>% 
    pull(dest) %>% .[1:ntop]
}

carrierToTopDest = function(data, topDest){
  data %>% mutate(carrierToToDest = dest %in% topDest)
}

Now you only need one simple mutation!

df = flights %>% nest_by(year, month) %>%  #Step 1
  mutate(topDest = list(ftopDest(data, 5)),  #Step 2
         data = list(carrierToTopDest(data, topDest)))  #Step 3
  

But let me describe step by step what is happening here.

In step one, let's nest the data into an internal tibble named data.

Output after Step 1

# A tibble: 12 x 3
# Rowwise:  year, month
    year month                data
   <int> <int> <list<tibble[,17]>>
 1  2013     1       [27,004 x 17]
 2  2013     2       [24,951 x 17]
 3  2013     3       [28,834 x 17]
 4  2013     4       [28,330 x 17]
 5  2013     5       [28,796 x 17]
 6  2013     6       [28,243 x 17]
 7  2013     7       [29,425 x 17]
 8  2013     8       [29,327 x 17]
 9  2013     9       [27,574 x 17]
10  2013    10       [28,889 x 17]
11  2013    11       [27,268 x 17]
12  2013    12       [28,135 x 17]

In step 2, we add the most popular flight destinations.

Output after step 2

# A tibble: 12 x 4
# Rowwise:  year, month
    year month                data topDest  
   <int> <int> <list<tibble[,17]>> <list>   
 1  2013     1       [27,004 x 17] <chr [5]>
 2  2013     2       [24,951 x 17] <chr [5]>
 3  2013     3       [28,834 x 17] <chr [5]>
 4  2013     4       [28,330 x 17] <chr [5]>
 5  2013     5       [28,796 x 17] <chr [5]>
 6  2013     6       [28,243 x 17] <chr [5]>
 7  2013     7       [29,425 x 17] <chr [5]>
 8  2013     8       [29,327 x 17] <chr [5]>
 9  2013     9       [27,574 x 17] <chr [5]>
10  2013    10       [28,889 x 17] <chr [5]>
11  2013    11       [27,268 x 17] <chr [5]>
12  2013    12       [28,135 x 17] <chr [5]>

In the last step, we add the carrierToToDest variable to the data variable, which determines whether the flight was going to one of the ntop places from the given month.

Output after step 3

# A tibble: 12 x 4
# Rowwise:  year, month
    year month data                   topDest  
   <int> <int> <list>                 <list>   
 1  2013     1 <tibble [27,004 x 18]> <chr [5]>
 2  2013     2 <tibble [24,951 x 18]> <chr [5]>
 3  2013     3 <tibble [28,834 x 18]> <chr [5]>
 4  2013     4 <tibble [28,330 x 18]> <chr [5]>
 5  2013     5 <tibble [28,796 x 18]> <chr [5]>
 6  2013     6 <tibble [28,243 x 18]> <chr [5]>
 7  2013     7 <tibble [29,425 x 18]> <chr [5]>
 8  2013     8 <tibble [29,327 x 18]> <chr [5]>
 9  2013     9 <tibble [27,574 x 18]> <chr [5]>
10  2013    10 <tibble [28,889 x 18]> <chr [5]>
11  2013    11 <tibble [27,268 x 18]> <chr [5]>
12  2013    12 <tibble [28,135 x 18]> <chr [5]>

How now we can see the most popular places. Let's do this:

df %>% mutate(topDest = paste(topDest, collapse = " "))

output

# A tibble: 12 x 4
# Rowwise:  year, month
    year month data                   topDest            
   <int> <int> <list>                 <chr>              
 1  2013     1 <tibble [27,004 x 18]> ATL ORD BOS MCO FLL
 2  2013     2 <tibble [24,951 x 18]> ATL ORD BOS MCO FLL
 3  2013     3 <tibble [28,834 x 18]> ATL ORD BOS MCO FLL
 4  2013     4 <tibble [28,330 x 18]> ATL ORD LAX BOS MCO
 5  2013     5 <tibble [28,796 x 18]> ORD ATL LAX BOS SFO
 6  2013     6 <tibble [28,243 x 18]> ORD ATL LAX BOS SFO
 7  2013     7 <tibble [29,425 x 18]> ORD ATL LAX BOS CLT
 8  2013     8 <tibble [29,327 x 18]> ORD ATL LAX BOS SFO
 9  2013     9 <tibble [27,574 x 18]> ORD LAX ATL BOS CLT
10  2013    10 <tibble [28,889 x 18]> ORD ATL LAX BOS CLT
11  2013    11 <tibble [27,268 x 18]> ATL ORD LAX BOS CLT
12  2013    12 <tibble [28,135 x 18]> ATL LAX MCO ORD CLT

Can we see flights to these destinations? Of course, it's not difficult.

df %>% select(-topDest) %>% 
  unnest(data) %>% 
  filter(carrierToToDest) %>% 
  select(year, month, flight, carrier, dest) 

Output

# A tibble: 80,941 x 5
# Groups:   year, month [12]
    year month flight carrier dest 
   <int> <int>  <int> <chr>   <chr>
 1  2013     1    461 DL      ATL  
 2  2013     1   1696 UA      ORD  
 3  2013     1    507 B6      FLL  
 4  2013     1     79 B6      MCO  
 5  2013     1    301 AA      ORD  
 6  2013     1   1806 B6      BOS  
 7  2013     1    371 B6      FLL  
 8  2013     1   4650 MQ      ATL  
 9  2013     1   1743 DL      ATL  
10  2013     1   3768 MQ      ORD  
# ... with 80,931 more rows

This is my recipe. Very simple and transparent in my opinion. I would be extremely obligated if you would try it on your data and let me know with efficiency.

Small update

I just noticed that I wanted to group not only after year (although you don't mention it, it must be so), month, but also by the carrier variable. So let's add it as another grouping variable.

df = flights %>% nest_by(year, month, carrier) %>%  
  mutate(topDest = list(ftopDest(data, 5)),  
         data = list(carrierToTopDest(data, topDest)))  

output

# A tibble: 185 x 5
# Rowwise:  year, month, carrier
    year month carrier data                  topDest  
   <int> <int> <chr>   <list>                <list>   
 1  2013     1 9E      <tibble [1,573 x 17]> <chr [5]>
 2  2013     1 AA      <tibble [2,794 x 17]> <chr [5]>
 3  2013     1 AS      <tibble [62 x 17]>    <chr [5]>
 4  2013     1 B6      <tibble [4,427 x 17]> <chr [5]>
 5  2013     1 DL      <tibble [3,690 x 17]> <chr [5]>
 6  2013     1 EV      <tibble [4,171 x 17]> <chr [5]>
 7  2013     1 F9      <tibble [59 x 17]>    <chr [5]>
 8  2013     1 FL      <tibble [328 x 17]>   <chr [5]>
 9  2013     1 HA      <tibble [31 x 17]>    <chr [5]>
10  2013     1 MQ      <tibble [2,271 x 17]> <chr [5]>
# ... with 175 more rows

Now let's get acquainted with the new top 5 directions.

df %>% mutate(topDest = paste(topDest, collapse = " "))

output

# A tibble: 185 x 5
# Rowwise:  year, month, carrier
    year month carrier data                  topDest            
   <int> <int> <chr>   <list>                <chr>              
 1  2013     1 9E      <tibble [1,573 x 17]> BOS PHL CVG MSP ORD
 2  2013     1 AA      <tibble [2,794 x 17]> DFW MIA ORD LAX BOS
 3  2013     1 AS      <tibble [62 x 17]>    SEA NA NA NA NA    
 4  2013     1 B6      <tibble [4,427 x 17]> FLL MCO BOS PBI SJU
 5  2013     1 DL      <tibble [3,690 x 17]> ATL DTW MCO FLL MIA
 6  2013     1 EV      <tibble [4,171 x 17]> IAD DTW DCA RDU CVG
 7  2013     1 F9      <tibble [59 x 17]>    DEN NA NA NA NA    
 8  2013     1 FL      <tibble [328 x 17]>   ATL CAK MKE NA NA  
 9  2013     1 HA      <tibble [31 x 17]>    HNL NA NA NA NA    
10  2013     1 MQ      <tibble [2,271 x 17]> RDU CMH ORD BNA ATL
# ... with 175 more rows

Summing up, I would like to add that the form is very clear for me. I can see the most popular df%>% mutate (topDest = paste (topDest, collapse =" ")) directions. I can filter all flights to the most popular destinations df%>% select (-topDest)%>% unnest (data)%>% filter (carrierToToDest)%>% select (year, month, flight, carrier, dest) and do any other transformations. I do not think that presenting the same information wider on over 100 variables is convenient for any analysis.

However, if you really need wider form, let me know. We'll do it this way.

A big update for anyone interested

Results not as expected!

Dear colleagues, you got caught up in some wrong path and you missed the fact that you are getting the wrong data when you are excited to find the most effective solutions!

@Emman posted a clear assignment which was as follows I'm interested to know about each carrier in each month: whether it flew to the most popular destinations. I define "most popular" by the top-5 most frequent dest values in each month, then intersecting all months' top-5s.

Solving it in my way, I will get the following most popular destinations in individual months:

df %>% mutate(topDest = paste(topDest, collapse = " ")) %>% 
  select(topDest)

output

# A tibble: 12 x 3
# Rowwise:  year, month
    year month topDest            
   <int> <int> <chr>              
 1  2013     1 ATL ORD BOS MCO FLL
 2  2013     2 ATL ORD BOS MCO FLL
 3  2013     3 ATL ORD BOS MCO FLL
 4  2013     4 ATL ORD LAX BOS MCO
 5  2013     5 ORD ATL LAX BOS SFO
 6  2013     6 ORD ATL LAX BOS SFO
 7  2013     7 ORD ATL LAX BOS CLT
 8  2013     8 ORD ATL LAX BOS SFO
 9  2013     9 ORD LAX ATL BOS CLT
10  2013    10 ORD ATL LAX BOS CLT
11  2013    11 ATL ORD LAX BOS CLT
12  2013    12 ATL LAX MCO ORD CLT

Let's check if I made a mistake by accident. Let's do a test for three sample months.

flights %>%
  filter(month==1) %>% 
  group_by(dest) %>%
  summarise(ndest = n()) %>%
  arrange(desc(ndest)) %>%
  pull(dest) %>% .[1:5]
#[1] "ATL" "ORD" "BOS" "MCO" "FLL"

flights %>%
  filter(month==6) %>% 
  group_by(dest) %>%
  summarise(ndest = n()) %>%
  arrange(desc(ndest)) %>%
  pull(dest) %>% .[1:5]
#[1] "ORD" "ATL" "LAX" "BOS" "SFO"

flights %>%
  filter(month==10) %>% 
  group_by(dest) %>%
  summarise(ndest = n()) %>%
  arrange(desc(ndest)) %>%
  pull(dest) %>% .[1:5]
#[1] "ORD" "ATL" "LAX" "BOS" "CLT"

Well, it's probably hard to deny that my results dont differ from those of the conclusive test.

It is also very clear that neither in January nor in February the direction CLT was not one of the 5 most popular destinations !!

However, if we compare it with the result expected and given by @Emman, I have to conclude that this expectation is inconsistent with the initial assumption!

##    carrier month ATL   ORD   BOS   MCO   FLL   LAX   SFO   CLT  
##    <chr>   <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1 9E          1 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  2 9E          2 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  3 9E          3 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  4 9E          4 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  5 9E          5 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  6 9E          6 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  7 9E          7 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  8 9E          8 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  9 9E          9 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
## 10 9E         10 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
## # ... with 175 more rows

From the above data taken from the author of the question, it can be concluded that the CLT direction was one of the five most desirable directions from January to October. Meanwhile, only July, September and October are correct.

Defending your own solution

Although I haven't run any performance tests yet, I would like to point out that even the fastest solution will be of no use if I return incorrect results.

Now a little bit of defense of your own solution. I know, I know, it sounds very immodest.

First of all, I got everything I needed in three simple and clear steps with one uncomplicated mutation.

Second, in the whole process, I didn't need any intermediate tables.

Thirdly, I kept the original form of the data, supplementing it only with the carrierToToDest variable, which means a flight to one of the top 5 directions, which will greatly facilitate subsequent filtering and further work on this data.

So let me remind you what needs to be done and re-assemble all the code we need below.

library(nycflights13)
library(tidyverse)


ftopDest = function(data, ntop){
  data %>%
    group_by(dest) %>%
    summarise(ndest = n()) %>%
    arrange(desc(ndest)) %>%
    pull(dest) %>% .[1:ntop]
}

carrierToTopDest = function(data, topDest){
  data %>% mutate(carrierToToDest = dest %in% topDest)
}

df = flights %>% nest_by(year, month) %>%  #Step 1
  mutate(topDest = list(ftopDest(data, 5)),  #Step 2
         data = list(carrierToTopDest(data, topDest)))  #Step 3

I will also remind you how to receive the most popular destinations in individual months.

df %>% mutate(topDest = paste(topDest, collapse = " ")) %>% 
  select(topDest)

output

# A tibble: 12 x 3
# Rowwise:  year, month
    year month topDest            
   <int> <int> <chr>              
 1  2013     1 ATL ORD BOS MCO FLL
 2  2013     2 ATL ORD BOS MCO FLL
 3  2013     3 ATL ORD BOS MCO FLL
 4  2013     4 ATL ORD LAX BOS MCO
 5  2013     5 ORD ATL LAX BOS SFO
 6  2013     6 ORD ATL LAX BOS SFO
 7  2013     7 ORD ATL LAX BOS CLT
 8  2013     8 ORD ATL LAX BOS SFO
 9  2013     9 ORD LAX ATL BOS CLT
10  2013    10 ORD ATL LAX BOS CLT
11  2013    11 ATL ORD LAX BOS CLT
12  2013    12 ATL LAX MCO ORD CLT

In turn, data recovery in its original form (along with the new variable carrierToToDest) can be obtained in this way

df %>% select(-topDest) %>% unnest(data)

output

# A tibble: 336,776 x 20
# Groups:   year, month [12]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr> 
 1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR   
 2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA   
 3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK   
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK   
 5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA   
 6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR   
 7  2013     1     1      555            600        -5      913            854        19 B6         507 N516JB  EWR   
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA   
 9  2013     1     1      557            600        -3      838            846        -8 B6          79 N593JB  JFK   
10  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA  LGA   
# ... with 336,766 more rows, and 7 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, carrierToToDest <lgl>

Data as expected by @Emman

However, if I would like to present this data in a form similar to that expected by @Emman, I can always do it like this.

df %>% select(-topDest) %>%
  unnest(data) %>%
  filter(carrierToToDest) %>%
  group_by(carrier, month, dest) %>% 
  summarise(v= T, .groups="drop") %>% 
  pivot_wider(names_from = dest, values_from =  v)

output

# A tibble: 125 x 10
   carrier month ATL   BOS   ORD   CLT   FLL   MCO   LAX   SFO  
   <chr>   <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
 1 9E          1 TRUE  TRUE  TRUE  NA    NA    NA    NA    NA   
 2 9E          2 TRUE  TRUE  TRUE  NA    NA    NA    NA    NA   
 3 9E          3 TRUE  TRUE  TRUE  NA    NA    NA    NA    NA   
 4 9E          4 NA    TRUE  TRUE  NA    NA    NA    NA    NA   
 5 9E          5 TRUE  TRUE  TRUE  NA    NA    NA    NA    NA   
 6 9E          6 NA    TRUE  TRUE  NA    NA    NA    NA    NA   
 7 9E          7 NA    TRUE  TRUE  TRUE  NA    NA    NA    NA   
 8 9E          8 NA    TRUE  TRUE  NA    NA    NA    NA    NA   
 9 9E          9 NA    TRUE  TRUE  TRUE  NA    NA    NA    NA   
10 9E         10 NA    TRUE  TRUE  TRUE  NA    NA    NA    NA   
# ... with 115 more rows

The main difference is that the data is correct against the assumptions though instead of FALSE it has the values NA.

Of course, nothing prevents you from adding mutate_if(is.logical, ~ifelse(is.na(.x), FALSE, .x)) at the end, which will replace every occurrence of NA with FALSE.

Additional stats

The organization of data in the form I propose will also allow you to easily extract additional statistics and various useful information. For example, if you are interested in which carrier is carrying the most flights to the most popular destinations, you can do this:

df %>% select(-topDest) %>%
  unnest(data) %>% 
  group_by(carrier, carrierToToDest) %>% 
  summarise(n = n(), .groups="drop") %>% 
  pivot_wider(names_from = carrierToToDest, values_from = n) %>% 
  mutate(prop = `TRUE`/`FALSE`)%>% 
  arrange(desc(prop))

output

# A tibble: 16 x 4
   carrier `FALSE` `TRUE`     prop
   <chr>     <int>  <int>    <dbl>
 1 FL          923   2337  2.53   
 2 VX         2387   2775  1.16   
 3 US        12866   7670  0.596  
 4 DL        31978  16132  0.504  
 5 AA        21793  10936  0.502  
 6 UA        39719  18946  0.477  
 7 YV          434    167  0.385  
 8 B6        43170  11465  0.266  
 9 MQ        21146   5251  0.248  
10 9E        16464   1996  0.121  
11 EV        50967   3206  0.0629 
12 OO           31      1  0.0323 
13 WN        12216     59  0.00483
14 AS          714     NA NA      
15 F9          685     NA NA      
16 HA          342     NA NA  

As you can see on an annual basis, FL had the most monthly flights to the most popular destinations. On the other hand, AS, F9 and HA never made such flights.

But maybe you are interested in it on a monthly basis. Nothing simpler. Just do this:

df %>% select(-topDest) %>%
  unnest(data) %>% 
  group_by(month, carrier, carrierToToDest) %>% 
  summarise(n = n(), .groups="drop") %>% 
  pivot_wider(names_from = carrierToToDest, values_from = n) %>% 
  mutate(prop = `TRUE`/`FALSE`) %>% 
  arrange(desc(prop))

output

# A tibble: 185 x 5
   month carrier `FALSE` `TRUE`  prop
   <int> <chr>     <int>  <int> <dbl>
 1     5 VX           31    465 15   
 2     6 VX           30    450 15   
 3     8 VX           31    458 14.8 
 4     9 YV            9     33  3.67
 5    10 FL           58    178  3.07
 6     5 FL           85    240  2.82
 7     4 FL           82    229  2.79
 8     3 FL           85    231  2.72
 9     2 FL           80    216  2.7 
10     1 FL           89    239  2.69
# ... with 175 more rows

As you can see here the winner is VX, which in May, June and August made a flight 15 times more often to the top 5 place than to other places.

Performance tests

Forgive me for not doing a performance test yet. Maybe soon. However, to all those who would like to make comparisons, please take into account two very important facts. First, I kept the dataframe in its original form. Secondly, I make the determination of the most popular directions, as it were, inside my calculations. Please include this in your possible performance tests.

Final apology

Of course, I think I can be wrong somewhere. Maybe I misread the author of the question? English is not my mother tongue so I could have made a mistake reading the assumptions. However, I don't know where the bug is or why our results differ.

like image 183
Marek Fiołka Avatar answered Oct 24 '22 04:10

Marek Fiołka


Update

I improved my data.table solution with the following one

thomas_data.table2 <- function() {
  library(data.table)
  dcast(
    data.table(dest = my_flights_top_dest_across_months)[
      unique(setDT(my_flights_raw)),
      on = .(dest)
    ],
    carrier + month ~ dest
  )[
    ,
    .(carrier, month, .SD[, my_flights_top_dest_across_months, with = FALSE] > 0)
  ]
}

and the benchmark is seen as below

enter image description here

Here is the benchmarking script:

library(nycflights13)
library(dplyr, warn.conflicts = FALSE)

# OP original
my_flights_raw <-
  flights %>%
  select(carrier, month, dest)

my_flights_agg <-
  my_flights_raw %>%
  count(month, dest, name = "n_obs") %>%
  arrange(month, desc(n_obs))

my_flights_top_dest_across_months <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5) %>%
  pull(dest) %>%
  unique()

my_flights_top_5_by_month <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5)

my_flights_top_dest_across_month <- unique(my_flights_top_5_by_month$dest)


op_slow <- function() {
  library(tidyr)
  library(tibble)
  library(purrr)

  my_flights_raw %>%
    group_by(carrier, month) %>%
    summarise(destinations_vec = list(unique(dest))) %>%
    add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
    mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x), .keep = "unused") %>%
    mutate(across(are_top_dest_included, ~ purrr::map(.x = ., .f = ~ setNames(object = .x, nm = my_flights_top_dest_across_month)))) %>%
    tidyr::unnest_wider(are_top_dest_included)
}


# OP collapse
op_collapse <- function() {
  library(magrittr)
  library(collapse)
  library(data.table)

  my_flights_raw %>%
    collapse::funique() %>%
    collapse::fgroup_by(carrier, month) %>%
    collapse::fsummarise(nested_dest = list(dest)) %>%
    collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
    collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
    setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
    collapse::qTBL()
}


# Thomas data.table

thomas_data.table1 <- function() {
  library(data.table)

  my_flights_top_dest_across_months <-
    data.table(
      dest = unique(my_flights_top_5_by_month$dest),
      fd = 1
    )

  dcast(my_flights_top_dest_across_months[
    setDT(my_flights_raw),
    on = .(dest)
  ],
  carrier + month ~ dest,
  fun.aggregate = function(x) sum(x) > 0
  )[, c(
    "carrier", "month",
    my_flights_top_dest_across_months$dest
  ), with = FALSE]
}

thomas_data.table2 <- function() {
  library(data.table)
  dcast(
    data.table(dest = my_flights_top_dest_across_months)[
      unique(setDT(my_flights_raw)),
      on = .(dest)
    ],
    carrier + month ~ dest
  )[
    ,
    .(carrier, month, .SD[, my_flights_top_dest_across_months, with = FALSE] > 0)
  ]
}

# output_op_slow <- op_slow()
# output_op_collapse <- op_collapse()
# output_thomas1 <- thomas_data.table1()
# output_thomas2 <- thomas_data.table2()
# #> Using 'month' as value column. Use 'value.var' to override

# waldo::compare(output_op_slow, output_op_collapse, ignore_attr = TRUE)
# #> v No differences
# waldo::compare(output_op_slow, as_tibble(output_thomas1), ignore_attr = TRUE)
# #> v No differences

bm <- bench::mark(
  op_slow = op_slow(),
  op_collapse = op_collapse(),
  thomas_dt1 = thomas_data.table1(),
  thomas_dt2 = thomas_data.table2(),
  check = FALSE,
  iterations = 100L
)

ggplot2::autoplot(bm)

Previous Answer

Given my_flights_top_5_by_month and my_flights_raw, we can try the following data.table approach

library(data.table)

my_flights_top_dest_across_months <- data.table(
  dest = unique(my_flights_top_5_by_month$dest),
  fd = 1
)
dcast(my_flights_top_dest_across_months[
  setDT(my_flights_raw),
  on = .(dest)
],
carrier + month ~ dest,
fun.aggregate = function(x) sum(x) > 0
)[, c(
  "carrier", "month",
  my_flights_top_dest_across_months$dest
), with = FALSE]

which gives

     carrier month   ATL   ORD   BOS   MCO   FLL   LAX   SFO  CLT
  1:      9E     1  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE TRUE
  2:      9E     2  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE TRUE
  3:      9E     3  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE TRUE
  4:      9E     4 FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE TRUE
  5:      9E     5  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE TRUE
 ---
181:      YV     8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
182:      YV     9 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
183:      YV    10 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
184:      YV    11 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
185:      YV    12 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
like image 40
ThomasIsCoding Avatar answered Oct 24 '22 05:10

ThomasIsCoding


Does this do what you want? As far as I can tell it matches your output but has more rows because it includes all months for all carriers; carrier "OO" only has flights in 5 months and your version only shows those 5 months in the summary.

With the data as provided (336k rows), this takes a similar amount of time as your function, but it's faster as you deal with larger data. When I run these on data 100x as big after setting my_flights_raw <- my_flights_raw %>% tidyr::uncount(100), to make it 33M rows, the code below is about 40% faster.

Given the large number of groups you're dealing with, I expect this is a situation where data.table will really shine with better performance.

library(tidyverse)
my_flights_raw %>%
  count(carrier, month, dest) %>%
  complete(carrier, month, dest) %>%
  filter(dest %in% my_flights_top_dest_across_months) %>%
  mutate(n = if_else(!is.na(n), TRUE, FALSE)) %>%
  pivot_wider(names_from = dest, values_from = n) 
like image 2
Jon Spring Avatar answered Oct 24 '22 04:10

Jon Spring


I took a stub myself, using functions from the collapse package.

library(magrittr)
library(collapse)
library(data.table)
  
my_flights_raw %>%
  collapse::funique() %>%
  collapse::fgroup_by(carrier, month) %>%
  collapse::fsummarise(nested_dest = list(dest)) %>%
  collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
  collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
  setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
  collapse::qTBL()

Unsurprisingly, collapse gives the fastest execution time. But I was surprised to see that @ThomasIsCoding's solution based on data.table was slower than my original tidyverse mix-and-match solution.

I also factored in the single data.table dependency in Thomas's answer, compared to the variety of dependencies in my original method.

library(nycflights13)
library(dplyr, warn.conflicts = FALSE)

# OP original
my_flights_raw <-
  flights %>%
  select(carrier, month, dest)

my_flights_agg <- 
  my_flights_raw %>%
  count(month, dest, name = "n_obs") %>%
  arrange(month, desc(n_obs)) 

my_flights_top_dest_across_months <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5) %>%
  pull(dest) %>%
  unique()

my_flights_top_5_by_month <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5)

my_flights_top_dest_across_month <- unique(my_flights_top_5_by_month$dest)


op_slow <- function() {
  library(tidyr)
  library(tibble)
  library(purrr)
  
  my_flights_raw %>%
    group_by(carrier, month) %>%
    summarise(destinations_vec = list(unique(dest))) %>%
    add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
    mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>%
    mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month))  )) %>%
    tidyr::unnest_wider(are_top_dest_included)
}  


# OP collapse
op_collapse <- function() {
  library(magrittr)
  library(collapse)
  library(data.table)
  
  my_flights_raw %>%
    collapse::funique() %>%
    collapse::fgroup_by(carrier, month) %>%
    collapse::fsummarise(nested_dest = list(dest)) %>%
    collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
    collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
    setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
    collapse::qTBL()
}
  

# Thomas data.table
thomas_data.table <- function() {
  library(data.table)
  
  my_flights_top_dest_across_months <- 
    data.table(
      dest = unique(my_flights_top_5_by_month$dest),
      fd = 1
    )
  
  dcast(my_flights_top_dest_across_months[
    setDT(my_flights_raw),
    on = .(dest)
  ],
  carrier + month ~ dest,
  fun.aggregate = function(x) sum(x) > 0
  )[, c(
    "carrier", "month",
    my_flights_top_dest_across_months$dest
  ), with = FALSE]
}

output_op_slow <- op_slow()
output_op_collapse <- op_collapse()
output_thomas <- thomas_data.table()
#> Using 'month' as value column. Use 'value.var' to override

waldo::compare(output_op_slow, output_op_collapse, ignore_attr = TRUE)
#> v No differences
waldo::compare(output_op_slow, as_tibble(output_thomas), ignore_attr = TRUE) 
#> v No differences

bm <- bench::mark(op_slow = op_slow(),
            op_collapse = op_collapse(),
            thomas_dt = thomas_data.table(),
            check = FALSE,
            iterations = 100)

ggplot2::autoplot(bm)

like image 2
Emman Avatar answered Oct 24 '22 05:10

Emman