I'm trying to summarize a dataset by groups, to have dummy columns for whether each group's values appear among the data's ungrouped most frequent values. As an example, let's take <code>flights</code> data from <code>nycflights13</code>. <pre class="prettyprint lang-r prettyprint-override"><code>library(dplyr, warn.conflicts = FALSE) library(nycflights13) my_flights_raw <- flights %>% select(carrier, month, dest) my_flights_raw #> # A tibble: 336,776 x 3 #> carrier month dest #> <chr> <int> <chr> #> 1 UA 1 IAH #> 2 UA 1 IAH #> 3 AA 1 MIA #> 4 B6 1 BQN #> 5 DL 1 ATL #> 6 UA 1 ORD #> 7 B6 1 FLL #> 8 EV 1 IAD #> 9 B6 1 MCO #> 10 AA 1 ORD #> # ... with 336,766 more rows </code></pre> My end-goal: I'm interested to know about each <code>carrier</code> in each <code>month</code>: whether it flew to the most popular destinations. I define "most popular" by the top-5 most frequent <code>dest</code> values in each month, then intersecting all months' top-5s. step 1 I start by simple aggregation by months: <pre class="prettyprint lang-r prettyprint-override"><code>my_flights_agg <- my_flights_raw %>% count(month, dest, name = "n_obs") %>% arrange(month, desc(n_obs)) my_flights_agg #> # A tibble: 1,113 x 3 #> month dest n_obs #> <int> <chr> <int> #> 1 1 ATL 1396 #> 2 1 ORD 1269 #> 3 1 BOS 1245 #> 4 1 MCO 1175 #> 5 1 FLL 1161 #> 6 1 LAX 1159 #> 7 1 CLT 1058 #> 8 1 MIA 981 #> 9 1 SFO 889 #> 10 1 DCA 865 #> # ... with 1,103 more rows </code></pre> step 2 And now I'm going to cut the data to keep only the top 5 most popular per month. <pre class="prettyprint lang-r prettyprint-override"><code>my_flights_top_5_by_month <- my_flights_agg %>% group_by(month) %>% slice_max(order_by = n_obs, n = 5) my_flights_top_5_by_month #> # A tibble: 60 x 3 #> # Groups: month [12] #> month dest n_obs #> <int> <chr> <int> #> 1 1 ATL 1396 #> 2 1 ORD 1269 #> 3 1 BOS 1245 #> 4 1 MCO 1175 #> 5 1 FLL 1161 #> 6 2 ATL 1267 #> 7 2 ORD 1197 #> 8 2 BOS 1182 #> 9 2 MCO 1110 #> 10 2 FLL 1073 #> # ... with 50 more rows </code></pre> step 3 Now simply get the <code>unique()</code> of <code>my_flights_top_5_by_month$dest</code>: <pre class="prettyprint lang-r prettyprint-override"><code>my_flights_top_dest_across_months <- unique(my_flights_top_5_by_month$dest) ## [1] "ATL" "ORD" "BOS" "MCO" "FLL" "LAX" "SFO" "CLT" </code></pre> <hr> Here's my question: given <code>my_flights_top_dest_across_months</code>, how can I summarize <code>my_flights_raw</code> to distinct <code>carrier</code> & <code>month</code>, such that the collapsing principle is whether each combination of <code>carrier</code> & <code>month</code> had flawn to each of the <code>dest</code> values in <code>my_flights_top_dest_across_months</code>? desired output <pre class="prettyprint lang-r prettyprint-override"><code>## carrier month ATL ORD BOS MCO FLL LAX SFO CLT ## <chr> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> ## 1 9E 1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 2 9E 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 3 9E 3 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 4 9E 4 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 5 9E 5 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 6 9E 6 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 7 9E 7 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 8 9E 8 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 9 9E 9 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 10 9E 10 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## # ... with 175 more rows </code></pre> <hr> I currently have the following code that is simply inefficient. It works fine for the example <code>flights</code> data, but is taking forever when applied on a large dataset (with several millions rows and groups). Any idea how the task described above can be done more efficiently? <pre class="prettyprint lang-r prettyprint-override"><code># too slow :( op_slow_output <- my_flights_raw %>% group_by(carrier, month) %>% summarise(destinations_vec = list(unique(dest))) %>% add_column(top_dest = list(my_flights_top_dest_across_month)) %>% mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>% mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month)) )) %>% tidyr::unnest_wider(are_top_dest_included) </code></pre>

It is quite possible that using the <code>data.table</code> library will be faster here. I will not argue. But I have mastered <code>dplyr</code> and would like to offer a pretty cool solution using the functions from this particular library. First, let's prepare two little auxiliary functions. We will see how they work later. <pre class="prettyprint"><code>library(nycflights13) library(tidyverse) ftopDest = function(data, ntop){ data %>% group_by(dest) %>% summarise(ndest = n()) %>% arrange(desc(ndest)) %>% pull(dest) %>% .[1:ntop] } carrierToTopDest = function(data, topDest){ data %>% mutate(carrierToToDest = dest %in% topDest) } </code></pre> Now you only need one simple mutation! <pre class="prettyprint"><code>df = flights %>% nest_by(year, month) %>% #Step 1 mutate(topDest = list(ftopDest(data, 5)), #Step 2 data = list(carrierToTopDest(data, topDest))) #Step 3 </code></pre> But let me describe step by step what is happening here. In step one, let's nest the data into an internal <code>tibble</code> named <code>data</code>. Output after Step 1 <pre class="prettyprint"><code># A tibble: 12 x 3 # Rowwise: year, month year month data <int> <int> <list<tibble[,17]>> 1 2013 1 [27,004 x 17] 2 2013 2 [24,951 x 17] 3 2013 3 [28,834 x 17] 4 2013 4 [28,330 x 17] 5 2013 5 [28,796 x 17] 6 2013 6 [28,243 x 17] 7 2013 7 [29,425 x 17] 8 2013 8 [29,327 x 17] 9 2013 9 [27,574 x 17] 10 2013 10 [28,889 x 17] 11 2013 11 [27,268 x 17] 12 2013 12 [28,135 x 17] </code></pre> In step 2, we add the most popular flight destinations. Output after step 2 <pre class="prettyprint"><code># A tibble: 12 x 4 # Rowwise: year, month year month data topDest <int> <int> <list<tibble[,17]>> <list> 1 2013 1 [27,004 x 17] <chr [5]> 2 2013 2 [24,951 x 17] <chr [5]> 3 2013 3 [28,834 x 17] <chr [5]> 4 2013 4 [28,330 x 17] <chr [5]> 5 2013 5 [28,796 x 17] <chr [5]> 6 2013 6 [28,243 x 17] <chr [5]> 7 2013 7 [29,425 x 17] <chr [5]> 8 2013 8 [29,327 x 17] <chr [5]> 9 2013 9 [27,574 x 17] <chr [5]> 10 2013 10 [28,889 x 17] <chr [5]> 11 2013 11 [27,268 x 17] <chr [5]> 12 2013 12 [28,135 x 17] <chr [5]> </code></pre> In the last step, we add the <code>carrierToToDest</code> variable to the <code>data</code> variable, which determines whether the flight was going to one of the <code>ntop</code> places from the given month. Output after step 3 <pre class="prettyprint"><code># A tibble: 12 x 4 # Rowwise: year, month year month data topDest <int> <int> <list> <list> 1 2013 1 <tibble [27,004 x 18]> <chr [5]> 2 2013 2 <tibble [24,951 x 18]> <chr [5]> 3 2013 3 <tibble [28,834 x 18]> <chr [5]> 4 2013 4 <tibble [28,330 x 18]> <chr [5]> 5 2013 5 <tibble [28,796 x 18]> <chr [5]> 6 2013 6 <tibble [28,243 x 18]> <chr [5]> 7 2013 7 <tibble [29,425 x 18]> <chr [5]> 8 2013 8 <tibble [29,327 x 18]> <chr [5]> 9 2013 9 <tibble [27,574 x 18]> <chr [5]> 10 2013 10 <tibble [28,889 x 18]> <chr [5]> 11 2013 11 <tibble [27,268 x 18]> <chr [5]> 12 2013 12 <tibble [28,135 x 18]> <chr [5]> </code></pre> How now we can see the most popular places. Let's do this: <pre class="prettyprint"><code>df %>% mutate(topDest = paste(topDest, collapse = " ")) </code></pre> output <pre class="prettyprint"><code># A tibble: 12 x 4 # Rowwise: year, month year month data topDest <int> <int> <list> <chr> 1 2013 1 <tibble [27,004 x 18]> ATL ORD BOS MCO FLL 2 2013 2 <tibble [24,951 x 18]> ATL ORD BOS MCO FLL 3 2013 3 <tibble [28,834 x 18]> ATL ORD BOS MCO FLL 4 2013 4 <tibble [28,330 x 18]> ATL ORD LAX BOS MCO 5 2013 5 <tibble [28,796 x 18]> ORD ATL LAX BOS SFO 6 2013 6 <tibble [28,243 x 18]> ORD ATL LAX BOS SFO 7 2013 7 <tibble [29,425 x 18]> ORD ATL LAX BOS CLT 8 2013 8 <tibble [29,327 x 18]> ORD ATL LAX BOS SFO 9 2013 9 <tibble [27,574 x 18]> ORD LAX ATL BOS CLT 10 2013 10 <tibble [28,889 x 18]> ORD ATL LAX BOS CLT 11 2013 11 <tibble [27,268 x 18]> ATL ORD LAX BOS CLT 12 2013 12 <tibble [28,135 x 18]> ATL LAX MCO ORD CLT </code></pre> Can we see flights to these destinations? Of course, it's not difficult. <pre class="prettyprint"><code>df %>% select(-topDest) %>% unnest(data) %>% filter(carrierToToDest) %>% select(year, month, flight, carrier, dest) </code></pre> Output <pre class="prettyprint"><code># A tibble: 80,941 x 5 # Groups: year, month [12] year month flight carrier dest <int> <int> <int> <chr> <chr> 1 2013 1 461 DL ATL 2 2013 1 1696 UA ORD 3 2013 1 507 B6 FLL 4 2013 1 79 B6 MCO 5 2013 1 301 AA ORD 6 2013 1 1806 B6 BOS 7 2013 1 371 B6 FLL 8 2013 1 4650 MQ ATL 9 2013 1 1743 DL ATL 10 2013 1 3768 MQ ORD # ... with 80,931 more rows </code></pre> This is my recipe. Very simple and transparent in my opinion. I would be extremely obligated if you would try it on your data and let me know with efficiency. <h3>Small update</h3> I just noticed that I wanted to group not only after <code>year</code> (although you don't mention it, it must be so), <code>month</code>, but also by the <code>carrier</code> variable. So let's add it as another grouping variable. <pre class="prettyprint"><code>df = flights %>% nest_by(year, month, carrier) %>% mutate(topDest = list(ftopDest(data, 5)), data = list(carrierToTopDest(data, topDest))) </code></pre> output <pre class="prettyprint"><code># A tibble: 185 x 5 # Rowwise: year, month, carrier year month carrier data topDest <int> <int> <chr> <list> <list> 1 2013 1 9E <tibble [1,573 x 17]> <chr [5]> 2 2013 1 AA <tibble [2,794 x 17]> <chr [5]> 3 2013 1 AS <tibble [62 x 17]> <chr [5]> 4 2013 1 B6 <tibble [4,427 x 17]> <chr [5]> 5 2013 1 DL <tibble [3,690 x 17]> <chr [5]> 6 2013 1 EV <tibble [4,171 x 17]> <chr [5]> 7 2013 1 F9 <tibble [59 x 17]> <chr [5]> 8 2013 1 FL <tibble [328 x 17]> <chr [5]> 9 2013 1 HA <tibble [31 x 17]> <chr [5]> 10 2013 1 MQ <tibble [2,271 x 17]> <chr [5]> # ... with 175 more rows </code></pre> Now let's get acquainted with the new top 5 directions. <pre class="prettyprint"><code>df %>% mutate(topDest = paste(topDest, collapse = " ")) </code></pre> output <pre class="prettyprint"><code># A tibble: 185 x 5 # Rowwise: year, month, carrier year month carrier data topDest <int> <int> <chr> <list> <chr> 1 2013 1 9E <tibble [1,573 x 17]> BOS PHL CVG MSP ORD 2 2013 1 AA <tibble [2,794 x 17]> DFW MIA ORD LAX BOS 3 2013 1 AS <tibble [62 x 17]> SEA NA NA NA NA 4 2013 1 B6 <tibble [4,427 x 17]> FLL MCO BOS PBI SJU 5 2013 1 DL <tibble [3,690 x 17]> ATL DTW MCO FLL MIA 6 2013 1 EV <tibble [4,171 x 17]> IAD DTW DCA RDU CVG 7 2013 1 F9 <tibble [59 x 17]> DEN NA NA NA NA 8 2013 1 FL <tibble [328 x 17]> ATL CAK MKE NA NA 9 2013 1 HA <tibble [31 x 17]> HNL NA NA NA NA 10 2013 1 MQ <tibble [2,271 x 17]> RDU CMH ORD BNA ATL # ... with 175 more rows </code></pre> Summing up, I would like to add that the form is very clear for me. I can see the most popular <code>df%>% mutate (topDest = paste (topDest, collapse =" "))</code> directions. I can filter all flights to the most popular destinations <code>df%>% select (-topDest)%>% unnest (data)%>% filter (carrierToToDest)%>% select (year, month, flight, carrier, dest)</code> and do any other transformations. I do not think that presenting the same information wider on over 100 variables is convenient for any analysis. However, if you really need wider form, let me know. We'll do it this way. <h3>A big update for anyone interested</h3> <h3>Results not as expected!</h3> Dear colleagues, you got caught up in some wrong path and you missed the fact that you are getting the wrong data when you are excited to find the most effective solutions! @Emman posted a clear assignment which was as follows I'm interested to know about each carrier in each month: whether it flew to the most popular destinations. I define "most popular" by the top-5 most frequent dest values in each month, then intersecting all months' top-5s. Solving it in my way, I will get the following most popular destinations in individual months: <pre class="prettyprint"><code>df %>% mutate(topDest = paste(topDest, collapse = " ")) %>% select(topDest) </code></pre> output <pre class="prettyprint"><code># A tibble: 12 x 3 # Rowwise: year, month year month topDest <int> <int> <chr> 1 2013 1 ATL ORD BOS MCO FLL 2 2013 2 ATL ORD BOS MCO FLL 3 2013 3 ATL ORD BOS MCO FLL 4 2013 4 ATL ORD LAX BOS MCO 5 2013 5 ORD ATL LAX BOS SFO 6 2013 6 ORD ATL LAX BOS SFO 7 2013 7 ORD ATL LAX BOS CLT 8 2013 8 ORD ATL LAX BOS SFO 9 2013 9 ORD LAX ATL BOS CLT 10 2013 10 ORD ATL LAX BOS CLT 11 2013 11 ATL ORD LAX BOS CLT 12 2013 12 ATL LAX MCO ORD CLT </code></pre> Let's check if I made a mistake by accident. Let's do a test for three sample months. <pre class="prettyprint"><code>flights %>% filter(month==1) %>% group_by(dest) %>% summarise(ndest = n()) %>% arrange(desc(ndest)) %>% pull(dest) %>% .[1:5] #[1] "ATL" "ORD" "BOS" "MCO" "FLL" flights %>% filter(month==6) %>% group_by(dest) %>% summarise(ndest = n()) %>% arrange(desc(ndest)) %>% pull(dest) %>% .[1:5] #[1] "ORD" "ATL" "LAX" "BOS" "SFO" flights %>% filter(month==10) %>% group_by(dest) %>% summarise(ndest = n()) %>% arrange(desc(ndest)) %>% pull(dest) %>% .[1:5] #[1] "ORD" "ATL" "LAX" "BOS" "CLT" </code></pre> Well, it's probably hard to deny that my results dont differ from those of the conclusive test. It is also very clear that neither in January nor in February the direction <code>CLT</code> was not one of the 5 most popular destinations !! However, if we compare it with the result expected and given by @Emman, I have to conclude that this expectation is inconsistent with the initial assumption! <pre class="prettyprint"><code>## carrier month ATL ORD BOS MCO FLL LAX SFO CLT ## <chr> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> ## 1 9E 1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 2 9E 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 3 9E 3 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 4 9E 4 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 5 9E 5 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 6 9E 6 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 7 9E 7 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 8 9E 8 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 9 9E 9 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## 10 9E 10 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## # ... with 175 more rows </code></pre> From the above data taken from the author of the question, it can be concluded that the <code>CLT</code> direction was one of the five most desirable directions from January to October. Meanwhile, only July, September and October are correct. <h3>Defending your own solution</h3> Although I haven't run any performance tests yet, I would like to point out that even the fastest solution will be of no use if I return incorrect results. Now a little bit of defense of your own solution. I know, I know, it sounds very immodest. First of all, I got everything I needed in three simple and clear steps with one uncomplicated mutation. Second, in the whole process, I didn't need any intermediate tables. Thirdly, I kept the original form of the data, supplementing it only with the carrierToToDest variable, which means a flight to one of the top 5 directions, which will greatly facilitate subsequent filtering and further work on this data. So let me remind you what needs to be done and re-assemble all the code we need below. <pre class="prettyprint"><code>library(nycflights13) library(tidyverse) ftopDest = function(data, ntop){ data %>% group_by(dest) %>% summarise(ndest = n()) %>% arrange(desc(ndest)) %>% pull(dest) %>% .[1:ntop] } carrierToTopDest = function(data, topDest){ data %>% mutate(carrierToToDest = dest %in% topDest) } df = flights %>% nest_by(year, month) %>% #Step 1 mutate(topDest = list(ftopDest(data, 5)), #Step 2 data = list(carrierToTopDest(data, topDest))) #Step 3 </code></pre> I will also remind you how to receive the most popular destinations in individual months. <pre class="prettyprint"><code>df %>% mutate(topDest = paste(topDest, collapse = " ")) %>% select(topDest) </code></pre> output <pre class="prettyprint"><code># A tibble: 12 x 3 # Rowwise: year, month year month topDest <int> <int> <chr> 1 2013 1 ATL ORD BOS MCO FLL 2 2013 2 ATL ORD BOS MCO FLL 3 2013 3 ATL ORD BOS MCO FLL 4 2013 4 ATL ORD LAX BOS MCO 5 2013 5 ORD ATL LAX BOS SFO 6 2013 6 ORD ATL LAX BOS SFO 7 2013 7 ORD ATL LAX BOS CLT 8 2013 8 ORD ATL LAX BOS SFO 9 2013 9 ORD LAX ATL BOS CLT 10 2013 10 ORD ATL LAX BOS CLT 11 2013 11 ATL ORD LAX BOS CLT 12 2013 12 ATL LAX MCO ORD CLT </code></pre> In turn, data recovery in its original form (along with the new variable <code>carrierToToDest</code>) can be obtained in this way <pre class="prettyprint"><code>df %>% select(-topDest) %>% unnest(data) </code></pre> output <pre class="prettyprint"><code># A tibble: 336,776 x 20 # Groups: year, month [12] year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA # ... with 336,766 more rows, and 7 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, # minute <dbl>, time_hour <dttm>, carrierToToDest <lgl> </code></pre> <h3>Data as expected by @Emman</h3> However, if I would like to present this data in a form similar to that expected by @Emman, I can always do it like this. <pre class="prettyprint"><code>df %>% select(-topDest) %>% unnest(data) %>% filter(carrierToToDest) %>% group_by(carrier, month, dest) %>% summarise(v= T, .groups="drop") %>% pivot_wider(names_from = dest, values_from = v) </code></pre> output <pre class="prettyprint"><code># A tibble: 125 x 10 carrier month ATL BOS ORD CLT FLL MCO LAX SFO <chr> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> 1 9E 1 TRUE TRUE TRUE NA NA NA NA NA 2 9E 2 TRUE TRUE TRUE NA NA NA NA NA 3 9E 3 TRUE TRUE TRUE NA NA NA NA NA 4 9E 4 NA TRUE TRUE NA NA NA NA NA 5 9E 5 TRUE TRUE TRUE NA NA NA NA NA 6 9E 6 NA TRUE TRUE NA NA NA NA NA 7 9E 7 NA TRUE TRUE TRUE NA NA NA NA 8 9E 8 NA TRUE TRUE NA NA NA NA NA 9 9E 9 NA TRUE TRUE TRUE NA NA NA NA 10 9E 10 NA TRUE TRUE TRUE NA NA NA NA # ... with 115 more rows </code></pre> The main difference is that the data is correct against the assumptions though instead of <code>FALSE</code> it has the values <code>NA</code>. Of course, nothing prevents you from adding <code>mutate_if(is.logical, ~ifelse(is.na(.x), FALSE, .x))</code> at the end, which will replace every occurrence of <code>NA</code> with <code>FALSE</code>. <h3>Additional stats</h3> The organization of data in the form I propose will also allow you to easily extract additional statistics and various useful information. For example, if you are interested in which carrier is carrying the most flights to the most popular destinations, you can do this: <pre class="prettyprint"><code>df %>% select(-topDest) %>% unnest(data) %>% group_by(carrier, carrierToToDest) %>% summarise(n = n(), .groups="drop") %>% pivot_wider(names_from = carrierToToDest, values_from = n) %>% mutate(prop = `TRUE`/`FALSE`)%>% arrange(desc(prop)) </code></pre> output <pre class="prettyprint"><code># A tibble: 16 x 4 carrier `FALSE` `TRUE` prop <chr> <int> <int> <dbl> 1 FL 923 2337 2.53 2 VX 2387 2775 1.16 3 US 12866 7670 0.596 4 DL 31978 16132 0.504 5 AA 21793 10936 0.502 6 UA 39719 18946 0.477 7 YV 434 167 0.385 8 B6 43170 11465 0.266 9 MQ 21146 5251 0.248 10 9E 16464 1996 0.121 11 EV 50967 3206 0.0629 12 OO 31 1 0.0323 13 WN 12216 59 0.00483 14 AS 714 NA NA 15 F9 685 NA NA 16 HA 342 NA NA </code></pre> As you can see on an annual basis, <code>FL</code> had the most monthly flights to the most popular destinations. On the other hand, <code>AS</code>,<code> F9</code> and <code>HA</code> never made such flights. But maybe you are interested in it on a monthly basis. Nothing simpler. Just do this: <pre class="prettyprint"><code>df %>% select(-topDest) %>% unnest(data) %>% group_by(month, carrier, carrierToToDest) %>% summarise(n = n(), .groups="drop") %>% pivot_wider(names_from = carrierToToDest, values_from = n) %>% mutate(prop = `TRUE`/`FALSE`) %>% arrange(desc(prop)) </code></pre> output <pre class="prettyprint"><code># A tibble: 185 x 5 month carrier `FALSE` `TRUE` prop <int> <chr> <int> <int> <dbl> 1 5 VX 31 465 15 2 6 VX 30 450 15 3 8 VX 31 458 14.8 4 9 YV 9 33 3.67 5 10 FL 58 178 3.07 6 5 FL 85 240 2.82 7 4 FL 82 229 2.79 8 3 FL 85 231 2.72 9 2 FL 80 216 2.7 10 1 FL 89 239 2.69 # ... with 175 more rows </code></pre> As you can see here the winner is <code>VX</code>, which in May, June and August made a flight 15 times more often to the top 5 place than to other places. <h3>Performance tests</h3> Forgive me for not doing a performance test yet. Maybe soon. However, to all those who would like to make comparisons, please take into account two very important facts. First, I kept the dataframe in its original form. Secondly, I make the determination of the most popular directions, as it were, inside my calculations. Please include this in your possible performance tests. <h3>Final apology</h3> Of course, I think I can be wrong somewhere. Maybe I misread the author of the question? English is not my mother tongue so I could have made a mistake reading the assumptions. However, I don't know where the bug is or why our results differ.

Does this do what you want? As far as I can tell it matches your output but has more rows because it includes all months for all carriers; <code>carrier</code> "OO" only has flights in 5 months and your version only shows those 5 months in the summary. With the data as provided (336k rows), this takes a similar amount of time as your function, but it's faster as you deal with larger data. When I run these on data 100x as big after setting <code>my_flights_raw <- my_flights_raw %>% tidyr::uncount(100)</code>, to make it 33M rows, the code below is about 40% faster. Given the large number of groups you're dealing with, I expect this is a situation where <code>data.table</code> will really shine with better performance. <pre class="prettyprint"><code>library(tidyverse) my_flights_raw %>% count(carrier, month, dest) %>% complete(carrier, month, dest) %>% filter(dest %in% my_flights_top_dest_across_months) %>% mutate(n = if_else(!is.na(n), TRUE, FALSE)) %>% pivot_wider(names_from = dest, values_from = n) </code></pre>

I took a stub myself, using functions from the <code>collapse</code> package. <pre class="prettyprint lang-r prettyprint-override"><code>library(magrittr) library(collapse) library(data.table) my_flights_raw %>% collapse::funique() %>% collapse::fgroup_by(carrier, month) %>% collapse::fsummarise(nested_dest = list(dest)) %>% collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>% collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>% setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>% collapse::qTBL() </code></pre> Unsurprisingly, <code>collapse</code> gives the fastest execution time. But I was surprised to see that @ThomasIsCoding's solution based on <code>data.table</code> was slower than my original <code>tidyverse</code> mix-and-match solution. I also factored in the single <code>data.table</code> dependency in Thomas's answer, compared to the variety of dependencies in my original method. <pre class="prettyprint lang-r prettyprint-override"><code>library(nycflights13) library(dplyr, warn.conflicts = FALSE) # OP original my_flights_raw <- flights %>% select(carrier, month, dest) my_flights_agg <- my_flights_raw %>% count(month, dest, name = "n_obs") %>% arrange(month, desc(n_obs)) my_flights_top_dest_across_months <- my_flights_agg %>% group_by(month) %>% slice_max(order_by = n_obs, n = 5) %>% pull(dest) %>% unique() my_flights_top_5_by_month <- my_flights_agg %>% group_by(month) %>% slice_max(order_by = n_obs, n = 5) my_flights_top_dest_across_month <- unique(my_flights_top_5_by_month$dest) op_slow <- function() { library(tidyr) library(tibble) library(purrr) my_flights_raw %>% group_by(carrier, month) %>% summarise(destinations_vec = list(unique(dest))) %>% add_column(top_dest = list(my_flights_top_dest_across_month)) %>% mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>% mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month)) )) %>% tidyr::unnest_wider(are_top_dest_included) } # OP collapse op_collapse <- function() { library(magrittr) library(collapse) library(data.table) my_flights_raw %>% collapse::funique() %>% collapse::fgroup_by(carrier, month) %>% collapse::fsummarise(nested_dest = list(dest)) %>% collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>% collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>% setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>% collapse::qTBL() } # Thomas data.table thomas_data.table <- function() { library(data.table) my_flights_top_dest_across_months <- data.table( dest = unique(my_flights_top_5_by_month$dest), fd = 1 ) dcast(my_flights_top_dest_across_months[ setDT(my_flights_raw), on = .(dest) ], carrier + month ~ dest, fun.aggregate = function(x) sum(x) > 0 )[, c( "carrier", "month", my_flights_top_dest_across_months$dest ), with = FALSE] } output_op_slow <- op_slow() output_op_collapse <- op_collapse() output_thomas <- thomas_data.table() #> Using 'month' as value column. Use 'value.var' to override waldo::compare(output_op_slow, output_op_collapse, ignore_attr = TRUE) #> v No differences waldo::compare(output_op_slow, as_tibble(output_thomas), ignore_attr = TRUE) #> v No differences bm <- bench::mark(op_slow = op_slow(), op_collapse = op_collapse(), thomas_dt = thomas_data.table(), check = FALSE, iterations = 100) ggplot2::autoplot(bm) </code></pre> <img src="https://i.imgur.com/ERTbstN.png" alt="">

How to summarize data by-group, by creating dummy variables as the collapsing method

Tags:

performance

r

dummy-variable

I'm trying to summarize a dataset by groups, to have dummy columns for whether each group's values appear among the data's ungrouped most frequent values.

As an example, let's take flights data from nycflights13.

library(dplyr, warn.conflicts = FALSE)
library(nycflights13)

my_flights_raw <-
  flights %>%
  select(carrier, month, dest)

my_flights_raw
#> # A tibble: 336,776 x 3
#>    carrier month dest 
#>    <chr>   <int> <chr>
#>  1 UA          1 IAH  
#>  2 UA          1 IAH  
#>  3 AA          1 MIA  
#>  4 B6          1 BQN  
#>  5 DL          1 ATL  
#>  6 UA          1 ORD  
#>  7 B6          1 FLL  
#>  8 EV          1 IAD  
#>  9 B6          1 MCO  
#> 10 AA          1 ORD  
#> # ... with 336,766 more rows

My end-goal: I'm interested to know about each carrier in each month: whether it flew to the most popular destinations. I define "most popular" by the top-5 most frequent dest values in each month, then intersecting all months' top-5s.

step 1
I start by simple aggregation by months:

my_flights_agg <- 
  my_flights_raw %>%
  count(month, dest, name = "n_obs") %>%
  arrange(month, desc(n_obs)) 

my_flights_agg
#> # A tibble: 1,113 x 3
#>    month dest  n_obs
#>    <int> <chr> <int>
#>  1     1 ATL    1396
#>  2     1 ORD    1269
#>  3     1 BOS    1245
#>  4     1 MCO    1175
#>  5     1 FLL    1161
#>  6     1 LAX    1159
#>  7     1 CLT    1058
#>  8     1 MIA     981
#>  9     1 SFO     889
#> 10     1 DCA     865
#> # ... with 1,103 more rows

step 2
And now I'm going to cut the data to keep only the top 5 most popular per month.

my_flights_top_5_by_month <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5)

my_flights_top_5_by_month
#> # A tibble: 60 x 3
#> # Groups:   month [12]
#>    month dest  n_obs
#>    <int> <chr> <int>
#>  1     1 ATL    1396
#>  2     1 ORD    1269
#>  3     1 BOS    1245
#>  4     1 MCO    1175
#>  5     1 FLL    1161
#>  6     2 ATL    1267
#>  7     2 ORD    1197
#>  8     2 BOS    1182
#>  9     2 MCO    1110
#> 10     2 FLL    1073
#> # ... with 50 more rows

step 3
Now simply get the unique() of my_flights_top_5_by_month$dest:

my_flights_top_dest_across_months <- unique(my_flights_top_5_by_month$dest)

## [1] "ATL" "ORD" "BOS" "MCO" "FLL" "LAX" "SFO" "CLT"

Here's my question: given my_flights_top_dest_across_months, how can I summarize my_flights_raw to distinct carrier & month, such that the collapsing principle is whether each combination of carrier & month had flawn to each of the dest values in my_flights_top_dest_across_months?

desired output

##    carrier month ATL   ORD   BOS   MCO   FLL   LAX   SFO   CLT  
##    <chr>   <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1 9E          1 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  2 9E          2 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  3 9E          3 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  4 9E          4 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  5 9E          5 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  6 9E          6 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  7 9E          7 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  8 9E          8 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  9 9E          9 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
## 10 9E         10 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
## # ... with 175 more rows

I currently have the following code that is simply inefficient. It works fine for the example flights data, but is taking forever when applied on a large dataset (with several millions rows and groups). Any idea how the task described above can be done more efficiently?

# too slow :(
op_slow_output <- 
  my_flights_raw %>%
  group_by(carrier, month) %>%
  summarise(destinations_vec = list(unique(dest))) %>%
  add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
  mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>%
  mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month))  )) %>%
  tidyr::unnest_wider(are_top_dest_included)

782

asked Nov 04 '21 22:11

Emman

Video Answer

4 Answers

It is quite possible that using the data.table library will be faster here. I will not argue. But I have mastered dplyr and would like to offer a pretty cool solution using the functions from this particular library.

First, let's prepare two little auxiliary functions. We will see how they work later.

library(nycflights13)
library(tidyverse)


ftopDest = function(data, ntop){
  data %>% 
    group_by(dest) %>% 
    summarise(ndest = n()) %>% 
    arrange(desc(ndest)) %>% 
    pull(dest) %>% .[1:ntop]
}

carrierToTopDest = function(data, topDest){
  data %>% mutate(carrierToToDest = dest %in% topDest)
}

Now you only need one simple mutation!

df = flights %>% nest_by(year, month) %>%  #Step 1
  mutate(topDest = list(ftopDest(data, 5)),  #Step 2
         data = list(carrierToTopDest(data, topDest)))  #Step 3

But let me describe step by step what is happening here.

In step one, let's nest the data into an internal tibble named data.

Output after Step 1

# A tibble: 12 x 3
# Rowwise:  year, month
    year month                data
   <int> <int> <list<tibble[,17]>>
 1  2013     1       [27,004 x 17]
 2  2013     2       [24,951 x 17]
 3  2013     3       [28,834 x 17]
 4  2013     4       [28,330 x 17]
 5  2013     5       [28,796 x 17]
 6  2013     6       [28,243 x 17]
 7  2013     7       [29,425 x 17]
 8  2013     8       [29,327 x 17]
 9  2013     9       [27,574 x 17]
10  2013    10       [28,889 x 17]
11  2013    11       [27,268 x 17]
12  2013    12       [28,135 x 17]

In step 2, we add the most popular flight destinations.

Output after step 2

# A tibble: 12 x 4
# Rowwise:  year, month
    year month                data topDest  
   <int> <int> <list<tibble[,17]>> <list>   
 1  2013     1       [27,004 x 17] <chr [5]>
 2  2013     2       [24,951 x 17] <chr [5]>
 3  2013     3       [28,834 x 17] <chr [5]>
 4  2013     4       [28,330 x 17] <chr [5]>
 5  2013     5       [28,796 x 17] <chr [5]>
 6  2013     6       [28,243 x 17] <chr [5]>
 7  2013     7       [29,425 x 17] <chr [5]>
 8  2013     8       [29,327 x 17] <chr [5]>
 9  2013     9       [27,574 x 17] <chr [5]>
10  2013    10       [28,889 x 17] <chr [5]>
11  2013    11       [27,268 x 17] <chr [5]>
12  2013    12       [28,135 x 17] <chr [5]>

In the last step, we add the carrierToToDest variable to the data variable, which determines whether the flight was going to one of the ntop places from the given month.

Output after step 3

# A tibble: 12 x 4
# Rowwise:  year, month
    year month data                   topDest  
   <int> <int> <list>                 <list>   
 1  2013     1 <tibble [27,004 x 18]> <chr [5]>
 2  2013     2 <tibble [24,951 x 18]> <chr [5]>
 3  2013     3 <tibble [28,834 x 18]> <chr [5]>
 4  2013     4 <tibble [28,330 x 18]> <chr [5]>
 5  2013     5 <tibble [28,796 x 18]> <chr [5]>
 6  2013     6 <tibble [28,243 x 18]> <chr [5]>
 7  2013     7 <tibble [29,425 x 18]> <chr [5]>
 8  2013     8 <tibble [29,327 x 18]> <chr [5]>
 9  2013     9 <tibble [27,574 x 18]> <chr [5]>
10  2013    10 <tibble [28,889 x 18]> <chr [5]>
11  2013    11 <tibble [27,268 x 18]> <chr [5]>
12  2013    12 <tibble [28,135 x 18]> <chr [5]>

How now we can see the most popular places. Let's do this:

df %>% mutate(topDest = paste(topDest, collapse = " "))

output

# A tibble: 12 x 4
# Rowwise:  year, month
    year month data                   topDest            
   <int> <int> <list>                 <chr>              
 1  2013     1 <tibble [27,004 x 18]> ATL ORD BOS MCO FLL
 2  2013     2 <tibble [24,951 x 18]> ATL ORD BOS MCO FLL
 3  2013     3 <tibble [28,834 x 18]> ATL ORD BOS MCO FLL
 4  2013     4 <tibble [28,330 x 18]> ATL ORD LAX BOS MCO
 5  2013     5 <tibble [28,796 x 18]> ORD ATL LAX BOS SFO
 6  2013     6 <tibble [28,243 x 18]> ORD ATL LAX BOS SFO
 7  2013     7 <tibble [29,425 x 18]> ORD ATL LAX BOS CLT
 8  2013     8 <tibble [29,327 x 18]> ORD ATL LAX BOS SFO
 9  2013     9 <tibble [27,574 x 18]> ORD LAX ATL BOS CLT
10  2013    10 <tibble [28,889 x 18]> ORD ATL LAX BOS CLT
11  2013    11 <tibble [27,268 x 18]> ATL ORD LAX BOS CLT
12  2013    12 <tibble [28,135 x 18]> ATL LAX MCO ORD CLT

Can we see flights to these destinations? Of course, it's not difficult.

df %>% select(-topDest) %>% 
  unnest(data) %>% 
  filter(carrierToToDest) %>% 
  select(year, month, flight, carrier, dest)

Output

# A tibble: 80,941 x 5
# Groups:   year, month [12]
    year month flight carrier dest 
   <int> <int>  <int> <chr>   <chr>
 1  2013     1    461 DL      ATL  
 2  2013     1   1696 UA      ORD  
 3  2013     1    507 B6      FLL  
 4  2013     1     79 B6      MCO  
 5  2013     1    301 AA      ORD  
 6  2013     1   1806 B6      BOS  
 7  2013     1    371 B6      FLL  
 8  2013     1   4650 MQ      ATL  
 9  2013     1   1743 DL      ATL  
10  2013     1   3768 MQ      ORD  
# ... with 80,931 more rows

This is my recipe. Very simple and transparent in my opinion. I would be extremely obligated if you would try it on your data and let me know with efficiency.

Small update

I just noticed that I wanted to group not only after year (although you don't mention it, it must be so), month, but also by the carrier variable. So let's add it as another grouping variable.

df = flights %>% nest_by(year, month, carrier) %>%  
  mutate(topDest = list(ftopDest(data, 5)),  
         data = list(carrierToTopDest(data, topDest)))

output

# A tibble: 185 x 5
# Rowwise:  year, month, carrier
    year month carrier data                  topDest  
   <int> <int> <chr>   <list>                <list>   
 1  2013     1 9E      <tibble [1,573 x 17]> <chr [5]>
 2  2013     1 AA      <tibble [2,794 x 17]> <chr [5]>
 3  2013     1 AS      <tibble [62 x 17]>    <chr [5]>
 4  2013     1 B6      <tibble [4,427 x 17]> <chr [5]>
 5  2013     1 DL      <tibble [3,690 x 17]> <chr [5]>
 6  2013     1 EV      <tibble [4,171 x 17]> <chr [5]>
 7  2013     1 F9      <tibble [59 x 17]>    <chr [5]>
 8  2013     1 FL      <tibble [328 x 17]>   <chr [5]>
 9  2013     1 HA      <tibble [31 x 17]>    <chr [5]>
10  2013     1 MQ      <tibble [2,271 x 17]> <chr [5]>
# ... with 175 more rows

Now let's get acquainted with the new top 5 directions.

df %>% mutate(topDest = paste(topDest, collapse = " "))

output

# A tibble: 185 x 5
# Rowwise:  year, month, carrier
    year month carrier data                  topDest            
   <int> <int> <chr>   <list>                <chr>              
 1  2013     1 9E      <tibble [1,573 x 17]> BOS PHL CVG MSP ORD
 2  2013     1 AA      <tibble [2,794 x 17]> DFW MIA ORD LAX BOS
 3  2013     1 AS      <tibble [62 x 17]>    SEA NA NA NA NA    
 4  2013     1 B6      <tibble [4,427 x 17]> FLL MCO BOS PBI SJU
 5  2013     1 DL      <tibble [3,690 x 17]> ATL DTW MCO FLL MIA
 6  2013     1 EV      <tibble [4,171 x 17]> IAD DTW DCA RDU CVG
 7  2013     1 F9      <tibble [59 x 17]>    DEN NA NA NA NA    
 8  2013     1 FL      <tibble [328 x 17]>   ATL CAK MKE NA NA  
 9  2013     1 HA      <tibble [31 x 17]>    HNL NA NA NA NA    
10  2013     1 MQ      <tibble [2,271 x 17]> RDU CMH ORD BNA ATL
# ... with 175 more rows

Summing up, I would like to add that the form is very clear for me. I can see the most popular df%>% mutate (topDest = paste (topDest, collapse =" ")) directions. I can filter all flights to the most popular destinations df%>% select (-topDest)%>% unnest (data)%>% filter (carrierToToDest)%>% select (year, month, flight, carrier, dest) and do any other transformations. I do not think that presenting the same information wider on over 100 variables is convenient for any analysis.

However, if you really need wider form, let me know. We'll do it this way.

A big update for anyone interested

Results not as expected!

Dear colleagues, you got caught up in some wrong path and you missed the fact that you are getting the wrong data when you are excited to find the most effective solutions!

@Emman posted a clear assignment which was as follows I'm interested to know about each carrier in each month: whether it flew to the most popular destinations. I define "most popular" by the top-5 most frequent dest values in each month, then intersecting all months' top-5s.

Solving it in my way, I will get the following most popular destinations in individual months:

df %>% mutate(topDest = paste(topDest, collapse = " ")) %>% 
  select(topDest)

output

# A tibble: 12 x 3
# Rowwise:  year, month
    year month topDest            
   <int> <int> <chr>              
 1  2013     1 ATL ORD BOS MCO FLL
 2  2013     2 ATL ORD BOS MCO FLL
 3  2013     3 ATL ORD BOS MCO FLL
 4  2013     4 ATL ORD LAX BOS MCO
 5  2013     5 ORD ATL LAX BOS SFO
 6  2013     6 ORD ATL LAX BOS SFO
 7  2013     7 ORD ATL LAX BOS CLT
 8  2013     8 ORD ATL LAX BOS SFO
 9  2013     9 ORD LAX ATL BOS CLT
10  2013    10 ORD ATL LAX BOS CLT
11  2013    11 ATL ORD LAX BOS CLT
12  2013    12 ATL LAX MCO ORD CLT

Let's check if I made a mistake by accident. Let's do a test for three sample months.

flights %>%
  filter(month==1) %>% 
  group_by(dest) %>%
  summarise(ndest = n()) %>%
  arrange(desc(ndest)) %>%
  pull(dest) %>% .[1:5]
#[1] "ATL" "ORD" "BOS" "MCO" "FLL"

flights %>%
  filter(month==6) %>% 
  group_by(dest) %>%
  summarise(ndest = n()) %>%
  arrange(desc(ndest)) %>%
  pull(dest) %>% .[1:5]
#[1] "ORD" "ATL" "LAX" "BOS" "SFO"

flights %>%
  filter(month==10) %>% 
  group_by(dest) %>%
  summarise(ndest = n()) %>%
  arrange(desc(ndest)) %>%
  pull(dest) %>% .[1:5]
#[1] "ORD" "ATL" "LAX" "BOS" "CLT"

Well, it's probably hard to deny that my results dont differ from those of the conclusive test.

It is also very clear that neither in January nor in February the direction CLT was not one of the 5 most popular destinations !!

However, if we compare it with the result expected and given by @Emman, I have to conclude that this expectation is inconsistent with the initial assumption!

##    carrier month ATL   ORD   BOS   MCO   FLL   LAX   SFO   CLT  
##    <chr>   <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1 9E          1 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  2 9E          2 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  3 9E          3 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  4 9E          4 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  5 9E          5 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  6 9E          6 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  7 9E          7 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  8 9E          8 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  9 9E          9 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
## 10 9E         10 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
## # ... with 175 more rows

From the above data taken from the author of the question, it can be concluded that the CLT direction was one of the five most desirable directions from January to October. Meanwhile, only July, September and October are correct.

Defending your own solution

Although I haven't run any performance tests yet, I would like to point out that even the fastest solution will be of no use if I return incorrect results.

Now a little bit of defense of your own solution. I know, I know, it sounds very immodest.

First of all, I got everything I needed in three simple and clear steps with one uncomplicated mutation.

Second, in the whole process, I didn't need any intermediate tables.

Thirdly, I kept the original form of the data, supplementing it only with the carrierToToDest variable, which means a flight to one of the top 5 directions, which will greatly facilitate subsequent filtering and further work on this data.

So let me remind you what needs to be done and re-assemble all the code we need below.

library(nycflights13)
library(tidyverse)


ftopDest = function(data, ntop){
  data %>%
    group_by(dest) %>%
    summarise(ndest = n()) %>%
    arrange(desc(ndest)) %>%
    pull(dest) %>% .[1:ntop]
}

carrierToTopDest = function(data, topDest){
  data %>% mutate(carrierToToDest = dest %in% topDest)
}

df = flights %>% nest_by(year, month) %>%  #Step 1
  mutate(topDest = list(ftopDest(data, 5)),  #Step 2
         data = list(carrierToTopDest(data, topDest)))  #Step 3

I will also remind you how to receive the most popular destinations in individual months.

df %>% mutate(topDest = paste(topDest, collapse = " ")) %>% 
  select(topDest)

output

# A tibble: 12 x 3
# Rowwise:  year, month
    year month topDest            
   <int> <int> <chr>              
 1  2013     1 ATL ORD BOS MCO FLL
 2  2013     2 ATL ORD BOS MCO FLL
 3  2013     3 ATL ORD BOS MCO FLL
 4  2013     4 ATL ORD LAX BOS MCO
 5  2013     5 ORD ATL LAX BOS SFO
 6  2013     6 ORD ATL LAX BOS SFO
 7  2013     7 ORD ATL LAX BOS CLT
 8  2013     8 ORD ATL LAX BOS SFO
 9  2013     9 ORD LAX ATL BOS CLT
10  2013    10 ORD ATL LAX BOS CLT
11  2013    11 ATL ORD LAX BOS CLT
12  2013    12 ATL LAX MCO ORD CLT

In turn, data recovery in its original form (along with the new variable carrierToToDest) can be obtained in this way

df %>% select(-topDest) %>% unnest(data)

output

# A tibble: 336,776 x 20
# Groups:   year, month [12]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr> 
 1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR   
 2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA   
 3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK   
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK   
 5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA   
 6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR   
 7  2013     1     1      555            600        -5      913            854        19 B6         507 N516JB  EWR   
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA   
 9  2013     1     1      557            600        -3      838            846        -8 B6          79 N593JB  JFK   
10  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA  LGA   
# ... with 336,766 more rows, and 7 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, carrierToToDest <lgl>

Data as expected by @Emman

However, if I would like to present this data in a form similar to that expected by @Emman, I can always do it like this.

df %>% select(-topDest) %>%
  unnest(data) %>%
  filter(carrierToToDest) %>%
  group_by(carrier, month, dest) %>% 
  summarise(v= T, .groups="drop") %>% 
  pivot_wider(names_from = dest, values_from =  v)

output

# A tibble: 125 x 10
   carrier month ATL   BOS   ORD   CLT   FLL   MCO   LAX   SFO  
   <chr>   <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
 1 9E          1 TRUE  TRUE  TRUE  NA    NA    NA    NA    NA   
 2 9E          2 TRUE  TRUE  TRUE  NA    NA    NA    NA    NA   
 3 9E          3 TRUE  TRUE  TRUE  NA    NA    NA    NA    NA   
 4 9E          4 NA    TRUE  TRUE  NA    NA    NA    NA    NA   
 5 9E          5 TRUE  TRUE  TRUE  NA    NA    NA    NA    NA   
 6 9E          6 NA    TRUE  TRUE  NA    NA    NA    NA    NA   
 7 9E          7 NA    TRUE  TRUE  TRUE  NA    NA    NA    NA   
 8 9E          8 NA    TRUE  TRUE  NA    NA    NA    NA    NA   
 9 9E          9 NA    TRUE  TRUE  TRUE  NA    NA    NA    NA   
10 9E         10 NA    TRUE  TRUE  TRUE  NA    NA    NA    NA   
# ... with 115 more rows

The main difference is that the data is correct against the assumptions though instead of FALSE it has the values NA.

Of course, nothing prevents you from adding mutate_if(is.logical, ~ifelse(is.na(.x), FALSE, .x)) at the end, which will replace every occurrence of NA with FALSE.

Additional stats

The organization of data in the form I propose will also allow you to easily extract additional statistics and various useful information. For example, if you are interested in which carrier is carrying the most flights to the most popular destinations, you can do this:

df %>% select(-topDest) %>%
  unnest(data) %>% 
  group_by(carrier, carrierToToDest) %>% 
  summarise(n = n(), .groups="drop") %>% 
  pivot_wider(names_from = carrierToToDest, values_from = n) %>% 
  mutate(prop = `TRUE`/`FALSE`)%>% 
  arrange(desc(prop))

output

# A tibble: 16 x 4
   carrier `FALSE` `TRUE`     prop
   <chr>     <int>  <int>    <dbl>
 1 FL          923   2337  2.53   
 2 VX         2387   2775  1.16   
 3 US        12866   7670  0.596  
 4 DL        31978  16132  0.504  
 5 AA        21793  10936  0.502  
 6 UA        39719  18946  0.477  
 7 YV          434    167  0.385  
 8 B6        43170  11465  0.266  
 9 MQ        21146   5251  0.248  
10 9E        16464   1996  0.121  
11 EV        50967   3206  0.0629 
12 OO           31      1  0.0323 
13 WN        12216     59  0.00483
14 AS          714     NA NA      
15 F9          685     NA NA      
16 HA          342     NA NA

As you can see on an annual basis, FL had the most monthly flights to the most popular destinations. On the other hand, AS, F9 and HA never made such flights.

But maybe you are interested in it on a monthly basis. Nothing simpler. Just do this:

df %>% select(-topDest) %>%
  unnest(data) %>% 
  group_by(month, carrier, carrierToToDest) %>% 
  summarise(n = n(), .groups="drop") %>% 
  pivot_wider(names_from = carrierToToDest, values_from = n) %>% 
  mutate(prop = `TRUE`/`FALSE`) %>% 
  arrange(desc(prop))

output

# A tibble: 185 x 5
   month carrier `FALSE` `TRUE`  prop
   <int> <chr>     <int>  <int> <dbl>
 1     5 VX           31    465 15   
 2     6 VX           30    450 15   
 3     8 VX           31    458 14.8 
 4     9 YV            9     33  3.67
 5    10 FL           58    178  3.07
 6     5 FL           85    240  2.82
 7     4 FL           82    229  2.79
 8     3 FL           85    231  2.72
 9     2 FL           80    216  2.7 
10     1 FL           89    239  2.69
# ... with 175 more rows

As you can see here the winner is VX, which in May, June and August made a flight 15 times more often to the top 5 place than to other places.

Performance tests

Forgive me for not doing a performance test yet. Maybe soon. However, to all those who would like to make comparisons, please take into account two very important facts. First, I kept the dataframe in its original form. Secondly, I make the determination of the most popular directions, as it were, inside my calculations. Please include this in your possible performance tests.

Final apology

Of course, I think I can be wrong somewhere. Maybe I misread the author of the question? English is not my mother tongue so I could have made a mistake reading the assumptions. However, I don't know where the bug is or why our results differ.

183

answered Oct 24 '22 04:10

Marek Fiołka

Update

I improved my data.table solution with the following one

thomas_data.table2 <- function() {
  library(data.table)
  dcast(
    data.table(dest = my_flights_top_dest_across_months)[
      unique(setDT(my_flights_raw)),
      on = .(dest)
    ],
    carrier + month ~ dest
  )[
    ,
    .(carrier, month, .SD[, my_flights_top_dest_across_months, with = FALSE] > 0)
  ]
}

and the benchmark is seen as below

enter image description here

Here is the benchmarking script:

library(nycflights13)
library(dplyr, warn.conflicts = FALSE)

# OP original
my_flights_raw <-
  flights %>%
  select(carrier, month, dest)

my_flights_agg <-
  my_flights_raw %>%
  count(month, dest, name = "n_obs") %>%
  arrange(month, desc(n_obs))

my_flights_top_dest_across_months <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5) %>%
  pull(dest) %>%
  unique()

my_flights_top_5_by_month <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5)

my_flights_top_dest_across_month <- unique(my_flights_top_5_by_month$dest)


op_slow <- function() {
  library(tidyr)
  library(tibble)
  library(purrr)

  my_flights_raw %>%
    group_by(carrier, month) %>%
    summarise(destinations_vec = list(unique(dest))) %>%
    add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
    mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x), .keep = "unused") %>%
    mutate(across(are_top_dest_included, ~ purrr::map(.x = ., .f = ~ setNames(object = .x, nm = my_flights_top_dest_across_month)))) %>%
    tidyr::unnest_wider(are_top_dest_included)
}


# OP collapse
op_collapse <- function() {
  library(magrittr)
  library(collapse)
  library(data.table)

  my_flights_raw %>%
    collapse::funique() %>%
    collapse::fgroup_by(carrier, month) %>%
    collapse::fsummarise(nested_dest = list(dest)) %>%
    collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
    collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
    setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
    collapse::qTBL()
}


# Thomas data.table

thomas_data.table1 <- function() {
  library(data.table)

  my_flights_top_dest_across_months <-
    data.table(
      dest = unique(my_flights_top_5_by_month$dest),
      fd = 1
    )

  dcast(my_flights_top_dest_across_months[
    setDT(my_flights_raw),
    on = .(dest)
  ],
  carrier + month ~ dest,
  fun.aggregate = function(x) sum(x) > 0
  )[, c(
    "carrier", "month",
    my_flights_top_dest_across_months$dest
  ), with = FALSE]
}

thomas_data.table2 <- function() {
  library(data.table)
  dcast(
    data.table(dest = my_flights_top_dest_across_months)[
      unique(setDT(my_flights_raw)),
      on = .(dest)
    ],
    carrier + month ~ dest
  )[
    ,
    .(carrier, month, .SD[, my_flights_top_dest_across_months, with = FALSE] > 0)
  ]
}

# output_op_slow <- op_slow()
# output_op_collapse <- op_collapse()
# output_thomas1 <- thomas_data.table1()
# output_thomas2 <- thomas_data.table2()
# #> Using 'month' as value column. Use 'value.var' to override

# waldo::compare(output_op_slow, output_op_collapse, ignore_attr = TRUE)
# #> v No differences
# waldo::compare(output_op_slow, as_tibble(output_thomas1), ignore_attr = TRUE)
# #> v No differences

bm <- bench::mark(
  op_slow = op_slow(),
  op_collapse = op_collapse(),
  thomas_dt1 = thomas_data.table1(),
  thomas_dt2 = thomas_data.table2(),
  check = FALSE,
  iterations = 100L
)

ggplot2::autoplot(bm)

Previous Answer

Given my_flights_top_5_by_month and my_flights_raw, we can try the following data.table approach

library(data.table)

my_flights_top_dest_across_months <- data.table(
  dest = unique(my_flights_top_5_by_month$dest),
  fd = 1
)
dcast(my_flights_top_dest_across_months[
  setDT(my_flights_raw),
  on = .(dest)
],
carrier + month ~ dest,
fun.aggregate = function(x) sum(x) > 0
)[, c(
  "carrier", "month",
  my_flights_top_dest_across_months$dest
), with = FALSE]

which gives

     carrier month   ATL   ORD   BOS   MCO   FLL   LAX   SFO  CLT
  1:      9E     1  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE TRUE
  2:      9E     2  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE TRUE
  3:      9E     3  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE TRUE
  4:      9E     4 FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE TRUE
  5:      9E     5  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE TRUE
 ---
181:      YV     8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
182:      YV     9 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
183:      YV    10 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
184:      YV    11 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
185:      YV    12 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

answered Oct 24 '22 05:10

ThomasIsCoding

Does this do what you want? As far as I can tell it matches your output but has more rows because it includes all months for all carriers; carrier "OO" only has flights in 5 months and your version only shows those 5 months in the summary.

With the data as provided (336k rows), this takes a similar amount of time as your function, but it's faster as you deal with larger data. When I run these on data 100x as big after setting my_flights_raw <- my_flights_raw %>% tidyr::uncount(100), to make it 33M rows, the code below is about 40% faster.

Given the large number of groups you're dealing with, I expect this is a situation where data.table will really shine with better performance.

library(tidyverse)
my_flights_raw %>%
  count(carrier, month, dest) %>%
  complete(carrier, month, dest) %>%
  filter(dest %in% my_flights_top_dest_across_months) %>%
  mutate(n = if_else(!is.na(n), TRUE, FALSE)) %>%
  pivot_wider(names_from = dest, values_from = n)

answered Oct 24 '22 04:10

Jon Spring

I took a stub myself, using functions from the collapse package.

library(magrittr)
library(collapse)
library(data.table)
  
my_flights_raw %>%
  collapse::funique() %>%
  collapse::fgroup_by(carrier, month) %>%
  collapse::fsummarise(nested_dest = list(dest)) %>%
  collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
  collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
  setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
  collapse::qTBL()

Unsurprisingly, collapse gives the fastest execution time. But I was surprised to see that @ThomasIsCoding's solution based on data.table was slower than my original tidyverse mix-and-match solution.

I also factored in the single data.table dependency in Thomas's answer, compared to the variety of dependencies in my original method.

library(nycflights13)
library(dplyr, warn.conflicts = FALSE)

# OP original
my_flights_raw <-
  flights %>%
  select(carrier, month, dest)

my_flights_agg <- 
  my_flights_raw %>%
  count(month, dest, name = "n_obs") %>%
  arrange(month, desc(n_obs)) 

my_flights_top_dest_across_months <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5) %>%
  pull(dest) %>%
  unique()

my_flights_top_5_by_month <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5)

my_flights_top_dest_across_month <- unique(my_flights_top_5_by_month$dest)


op_slow <- function() {
  library(tidyr)
  library(tibble)
  library(purrr)
  
  my_flights_raw %>%
    group_by(carrier, month) %>%
    summarise(destinations_vec = list(unique(dest))) %>%
    add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
    mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>%
    mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month))  )) %>%
    tidyr::unnest_wider(are_top_dest_included)
}  


# OP collapse
op_collapse <- function() {
  library(magrittr)
  library(collapse)
  library(data.table)
  
  my_flights_raw %>%
    collapse::funique() %>%
    collapse::fgroup_by(carrier, month) %>%
    collapse::fsummarise(nested_dest = list(dest)) %>%
    collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
    collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
    setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
    collapse::qTBL()
}
  

# Thomas data.table
thomas_data.table <- function() {
  library(data.table)
  
  my_flights_top_dest_across_months <- 
    data.table(
      dest = unique(my_flights_top_5_by_month$dest),
      fd = 1
    )
  
  dcast(my_flights_top_dest_across_months[
    setDT(my_flights_raw),
    on = .(dest)
  ],
  carrier + month ~ dest,
  fun.aggregate = function(x) sum(x) > 0
  )[, c(
    "carrier", "month",
    my_flights_top_dest_across_months$dest
  ), with = FALSE]
}

output_op_slow <- op_slow()
output_op_collapse <- op_collapse()
output_thomas <- thomas_data.table()
#> Using 'month' as value column. Use 'value.var' to override

waldo::compare(output_op_slow, output_op_collapse, ignore_attr = TRUE)
#> v No differences
waldo::compare(output_op_slow, as_tibble(output_thomas), ignore_attr = TRUE) 
#> v No differences

bm <- bench::mark(op_slow = op_slow(),
            op_collapse = op_collapse(),
            thomas_dt = thomas_data.table(),
            check = FALSE,
            iterations = 100)

ggplot2::autoplot(bm)

answered Oct 24 '22 05:10

Emman

Related questions
                            
                                Printing p-values with <0.001
                            
                                data.table function works in script but not in package
                            
                                How to build a pdf vignette in R and RStudio
                            
                                How to get line breaks in equation when knitting to pdf?
                            
                                size legend for plotly bubble map/chart
                            
                                Bookdown: Fix extra space before Chinese string inside R code chunk
                            
                                How does doRedis work?
                            
                                R / Sweave formatting numbers with \Sexpr{} in scientific notation
                            
                                how to skip through a loop when debugging R code
                            
                                ggplot legend showing transparency and fill color
                            
                                Advanced error handling
                            
                                How does glmnet's standardize argument handle dummy variables?
                            
                                Shiny: download zip archive
                            
                                How to change the height of select dropdown in shiny?
                            
                                Add beamer frame options in knitr/rmarkdown
                            
                                Add Polygons to R shiny leaflet map
                            
                                Consistent color scale and legend between plots when not all levels of a grouping variable are present in the data
                            
                                Getting separate axis labels on R plotly subplots
                            
                                rattle installation error: Invalid root element:
                            
                                Shuffle Vector in R, But Identical Elements Should Have Minimum Distance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to summarize data by-group, by creating dummy variables as the collapsing method

Tags:

performance

r

dummy-variable

Emman

People also ask

Video Answer

4 Answers

Small update

A big update for anyone interested

Results not as expected!

Defending your own solution

Data as expected by @Emman

Additional stats

Performance tests

Final apology

Marek Fiołka

Update

Previous Answer

ThomasIsCoding

Jon Spring

Emman

Recent Activity

Donate For Us