I'm trying to summarize a dataset by groups, to have dummy columns for whether each group's values appear among the data's ungrouped most frequent values.
As an example, let's take flights
data from nycflights13
.
library(dplyr, warn.conflicts = FALSE)
library(nycflights13)
my_flights_raw <-
flights %>%
select(carrier, month, dest)
my_flights_raw
#> # A tibble: 336,776 x 3
#> carrier month dest
#> <chr> <int> <chr>
#> 1 UA 1 IAH
#> 2 UA 1 IAH
#> 3 AA 1 MIA
#> 4 B6 1 BQN
#> 5 DL 1 ATL
#> 6 UA 1 ORD
#> 7 B6 1 FLL
#> 8 EV 1 IAD
#> 9 B6 1 MCO
#> 10 AA 1 ORD
#> # ... with 336,766 more rows
My end-goal: I'm interested to know about each carrier
in each month
: whether it flew to the most popular destinations. I define "most popular" by the top-5 most frequent dest
values in each month, then intersecting all months' top-5s.
step 1
I start by simple aggregation by months:
my_flights_agg <-
my_flights_raw %>%
count(month, dest, name = "n_obs") %>%
arrange(month, desc(n_obs))
my_flights_agg
#> # A tibble: 1,113 x 3
#> month dest n_obs
#> <int> <chr> <int>
#> 1 1 ATL 1396
#> 2 1 ORD 1269
#> 3 1 BOS 1245
#> 4 1 MCO 1175
#> 5 1 FLL 1161
#> 6 1 LAX 1159
#> 7 1 CLT 1058
#> 8 1 MIA 981
#> 9 1 SFO 889
#> 10 1 DCA 865
#> # ... with 1,103 more rows
step 2
And now I'm going to cut the data to keep only the top 5 most popular per month.
my_flights_top_5_by_month <-
my_flights_agg %>%
group_by(month) %>%
slice_max(order_by = n_obs, n = 5)
my_flights_top_5_by_month
#> # A tibble: 60 x 3
#> # Groups: month [12]
#> month dest n_obs
#> <int> <chr> <int>
#> 1 1 ATL 1396
#> 2 1 ORD 1269
#> 3 1 BOS 1245
#> 4 1 MCO 1175
#> 5 1 FLL 1161
#> 6 2 ATL 1267
#> 7 2 ORD 1197
#> 8 2 BOS 1182
#> 9 2 MCO 1110
#> 10 2 FLL 1073
#> # ... with 50 more rows
step 3
Now simply get the unique()
of my_flights_top_5_by_month$dest
:
my_flights_top_dest_across_months <- unique(my_flights_top_5_by_month$dest)
## [1] "ATL" "ORD" "BOS" "MCO" "FLL" "LAX" "SFO" "CLT"
Here's my question: given my_flights_top_dest_across_months
, how can I summarize my_flights_raw
to distinct carrier
& month
, such that the collapsing principle is whether each combination of carrier
& month
had flawn to each of the dest
values in my_flights_top_dest_across_months
?
desired output
## carrier month ATL ORD BOS MCO FLL LAX SFO CLT
## <chr> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 9E 1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 2 9E 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 3 9E 3 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 4 9E 4 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 5 9E 5 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 6 9E 6 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 7 9E 7 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 8 9E 8 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 9 9E 9 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 10 9E 10 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## # ... with 175 more rows
I currently have the following code that is simply inefficient. It works fine for the example flights
data, but is taking forever when applied on a large dataset (with several millions rows and groups). Any idea how the task described above can be done more efficiently?
# too slow :(
op_slow_output <-
my_flights_raw %>%
group_by(carrier, month) %>%
summarise(destinations_vec = list(unique(dest))) %>%
add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>%
mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month)) )) %>%
tidyr::unnest_wider(are_top_dest_included)
There are two steps to successfully set up dummy variables in a multiple regression: (1) create dummy variables that represent the categories of your categorical independent variable; and (2) enter values into these dummy variables – known as dummy coding – to represent the categories of the categorical independent ...
The basic syntax that we’ll use to group and summarize data is as follows: data %>% group_by(col_name) %>% summarize(summary_name = summary_function) Note: The functions summarize () and summarise () are equivalent. Example 1: Find Mean & Median by Group
Two of the most common tasks that you’ll perform in data analysis are grouping and summarizing data. Fortunately the dplyr package in R allows you to quickly group and summarize data. This tutorial provides a quick guide to getting started with dplyr. Before you can use the functions in the dplyr package, you must first load the package:
A summary statistic can be realized among multiple groups. group_by (yearID, teamID): Group by year and team Before you intend to do an operation, you can filter the dataset. The dataset starts in 1871, and the analysis does not need the years prior to 1980.
You should always summarize a sample of data values to make them more easily understood (by you and others). At the very least you need to show: Middle value – centrality, that is, an average. Dispersion – how spread out the data are around the average. Replication – how large the sample is.
It is quite possible that using the data.table
library will be faster here. I will not argue. But I have mastered dplyr
and would like to offer a pretty cool solution using the functions from this particular library.
First, let's prepare two little auxiliary functions. We will see how they work later.
library(nycflights13)
library(tidyverse)
ftopDest = function(data, ntop){
data %>%
group_by(dest) %>%
summarise(ndest = n()) %>%
arrange(desc(ndest)) %>%
pull(dest) %>% .[1:ntop]
}
carrierToTopDest = function(data, topDest){
data %>% mutate(carrierToToDest = dest %in% topDest)
}
Now you only need one simple mutation!
df = flights %>% nest_by(year, month) %>% #Step 1
mutate(topDest = list(ftopDest(data, 5)), #Step 2
data = list(carrierToTopDest(data, topDest))) #Step 3
But let me describe step by step what is happening here.
In step one, let's nest the data into an internal tibble
named data
.
Output after Step 1
# A tibble: 12 x 3
# Rowwise: year, month
year month data
<int> <int> <list<tibble[,17]>>
1 2013 1 [27,004 x 17]
2 2013 2 [24,951 x 17]
3 2013 3 [28,834 x 17]
4 2013 4 [28,330 x 17]
5 2013 5 [28,796 x 17]
6 2013 6 [28,243 x 17]
7 2013 7 [29,425 x 17]
8 2013 8 [29,327 x 17]
9 2013 9 [27,574 x 17]
10 2013 10 [28,889 x 17]
11 2013 11 [27,268 x 17]
12 2013 12 [28,135 x 17]
In step 2, we add the most popular flight destinations.
Output after step 2
# A tibble: 12 x 4
# Rowwise: year, month
year month data topDest
<int> <int> <list<tibble[,17]>> <list>
1 2013 1 [27,004 x 17] <chr [5]>
2 2013 2 [24,951 x 17] <chr [5]>
3 2013 3 [28,834 x 17] <chr [5]>
4 2013 4 [28,330 x 17] <chr [5]>
5 2013 5 [28,796 x 17] <chr [5]>
6 2013 6 [28,243 x 17] <chr [5]>
7 2013 7 [29,425 x 17] <chr [5]>
8 2013 8 [29,327 x 17] <chr [5]>
9 2013 9 [27,574 x 17] <chr [5]>
10 2013 10 [28,889 x 17] <chr [5]>
11 2013 11 [27,268 x 17] <chr [5]>
12 2013 12 [28,135 x 17] <chr [5]>
In the last step, we add the carrierToToDest
variable to the data
variable, which determines whether the flight was going to one of the ntop
places from the given month.
Output after step 3
# A tibble: 12 x 4
# Rowwise: year, month
year month data topDest
<int> <int> <list> <list>
1 2013 1 <tibble [27,004 x 18]> <chr [5]>
2 2013 2 <tibble [24,951 x 18]> <chr [5]>
3 2013 3 <tibble [28,834 x 18]> <chr [5]>
4 2013 4 <tibble [28,330 x 18]> <chr [5]>
5 2013 5 <tibble [28,796 x 18]> <chr [5]>
6 2013 6 <tibble [28,243 x 18]> <chr [5]>
7 2013 7 <tibble [29,425 x 18]> <chr [5]>
8 2013 8 <tibble [29,327 x 18]> <chr [5]>
9 2013 9 <tibble [27,574 x 18]> <chr [5]>
10 2013 10 <tibble [28,889 x 18]> <chr [5]>
11 2013 11 <tibble [27,268 x 18]> <chr [5]>
12 2013 12 <tibble [28,135 x 18]> <chr [5]>
How now we can see the most popular places. Let's do this:
df %>% mutate(topDest = paste(topDest, collapse = " "))
output
# A tibble: 12 x 4
# Rowwise: year, month
year month data topDest
<int> <int> <list> <chr>
1 2013 1 <tibble [27,004 x 18]> ATL ORD BOS MCO FLL
2 2013 2 <tibble [24,951 x 18]> ATL ORD BOS MCO FLL
3 2013 3 <tibble [28,834 x 18]> ATL ORD BOS MCO FLL
4 2013 4 <tibble [28,330 x 18]> ATL ORD LAX BOS MCO
5 2013 5 <tibble [28,796 x 18]> ORD ATL LAX BOS SFO
6 2013 6 <tibble [28,243 x 18]> ORD ATL LAX BOS SFO
7 2013 7 <tibble [29,425 x 18]> ORD ATL LAX BOS CLT
8 2013 8 <tibble [29,327 x 18]> ORD ATL LAX BOS SFO
9 2013 9 <tibble [27,574 x 18]> ORD LAX ATL BOS CLT
10 2013 10 <tibble [28,889 x 18]> ORD ATL LAX BOS CLT
11 2013 11 <tibble [27,268 x 18]> ATL ORD LAX BOS CLT
12 2013 12 <tibble [28,135 x 18]> ATL LAX MCO ORD CLT
Can we see flights to these destinations? Of course, it's not difficult.
df %>% select(-topDest) %>%
unnest(data) %>%
filter(carrierToToDest) %>%
select(year, month, flight, carrier, dest)
Output
# A tibble: 80,941 x 5
# Groups: year, month [12]
year month flight carrier dest
<int> <int> <int> <chr> <chr>
1 2013 1 461 DL ATL
2 2013 1 1696 UA ORD
3 2013 1 507 B6 FLL
4 2013 1 79 B6 MCO
5 2013 1 301 AA ORD
6 2013 1 1806 B6 BOS
7 2013 1 371 B6 FLL
8 2013 1 4650 MQ ATL
9 2013 1 1743 DL ATL
10 2013 1 3768 MQ ORD
# ... with 80,931 more rows
This is my recipe. Very simple and transparent in my opinion. I would be extremely obligated if you would try it on your data and let me know with efficiency.
I just noticed that I wanted to group not only after year
(although you don't mention it, it must be so), month
, but also by the carrier
variable. So let's add it as another grouping variable.
df = flights %>% nest_by(year, month, carrier) %>%
mutate(topDest = list(ftopDest(data, 5)),
data = list(carrierToTopDest(data, topDest)))
output
# A tibble: 185 x 5
# Rowwise: year, month, carrier
year month carrier data topDest
<int> <int> <chr> <list> <list>
1 2013 1 9E <tibble [1,573 x 17]> <chr [5]>
2 2013 1 AA <tibble [2,794 x 17]> <chr [5]>
3 2013 1 AS <tibble [62 x 17]> <chr [5]>
4 2013 1 B6 <tibble [4,427 x 17]> <chr [5]>
5 2013 1 DL <tibble [3,690 x 17]> <chr [5]>
6 2013 1 EV <tibble [4,171 x 17]> <chr [5]>
7 2013 1 F9 <tibble [59 x 17]> <chr [5]>
8 2013 1 FL <tibble [328 x 17]> <chr [5]>
9 2013 1 HA <tibble [31 x 17]> <chr [5]>
10 2013 1 MQ <tibble [2,271 x 17]> <chr [5]>
# ... with 175 more rows
Now let's get acquainted with the new top 5 directions.
df %>% mutate(topDest = paste(topDest, collapse = " "))
output
# A tibble: 185 x 5
# Rowwise: year, month, carrier
year month carrier data topDest
<int> <int> <chr> <list> <chr>
1 2013 1 9E <tibble [1,573 x 17]> BOS PHL CVG MSP ORD
2 2013 1 AA <tibble [2,794 x 17]> DFW MIA ORD LAX BOS
3 2013 1 AS <tibble [62 x 17]> SEA NA NA NA NA
4 2013 1 B6 <tibble [4,427 x 17]> FLL MCO BOS PBI SJU
5 2013 1 DL <tibble [3,690 x 17]> ATL DTW MCO FLL MIA
6 2013 1 EV <tibble [4,171 x 17]> IAD DTW DCA RDU CVG
7 2013 1 F9 <tibble [59 x 17]> DEN NA NA NA NA
8 2013 1 FL <tibble [328 x 17]> ATL CAK MKE NA NA
9 2013 1 HA <tibble [31 x 17]> HNL NA NA NA NA
10 2013 1 MQ <tibble [2,271 x 17]> RDU CMH ORD BNA ATL
# ... with 175 more rows
Summing up, I would like to add that the form is very clear for me. I can see the most popular df%>% mutate (topDest = paste (topDest, collapse =" "))
directions. I can filter all flights to the most popular destinations df%>% select (-topDest)%>% unnest (data)%>% filter (carrierToToDest)%>% select (year, month, flight, carrier, dest)
and do any other transformations. I do not think that presenting the same information wider on over 100 variables is convenient for any analysis.
However, if you really need wider form, let me know. We'll do it this way.
Dear colleagues, you got caught up in some wrong path and you missed the fact that you are getting the wrong data when you are excited to find the most effective solutions!
@Emman posted a clear assignment which was as follows I'm interested to know about each carrier in each month: whether it flew to the most popular destinations. I define "most popular" by the top-5 most frequent dest values in each month, then intersecting all months' top-5s.
Solving it in my way, I will get the following most popular destinations in individual months:
df %>% mutate(topDest = paste(topDest, collapse = " ")) %>%
select(topDest)
output
# A tibble: 12 x 3
# Rowwise: year, month
year month topDest
<int> <int> <chr>
1 2013 1 ATL ORD BOS MCO FLL
2 2013 2 ATL ORD BOS MCO FLL
3 2013 3 ATL ORD BOS MCO FLL
4 2013 4 ATL ORD LAX BOS MCO
5 2013 5 ORD ATL LAX BOS SFO
6 2013 6 ORD ATL LAX BOS SFO
7 2013 7 ORD ATL LAX BOS CLT
8 2013 8 ORD ATL LAX BOS SFO
9 2013 9 ORD LAX ATL BOS CLT
10 2013 10 ORD ATL LAX BOS CLT
11 2013 11 ATL ORD LAX BOS CLT
12 2013 12 ATL LAX MCO ORD CLT
Let's check if I made a mistake by accident. Let's do a test for three sample months.
flights %>%
filter(month==1) %>%
group_by(dest) %>%
summarise(ndest = n()) %>%
arrange(desc(ndest)) %>%
pull(dest) %>% .[1:5]
#[1] "ATL" "ORD" "BOS" "MCO" "FLL"
flights %>%
filter(month==6) %>%
group_by(dest) %>%
summarise(ndest = n()) %>%
arrange(desc(ndest)) %>%
pull(dest) %>% .[1:5]
#[1] "ORD" "ATL" "LAX" "BOS" "SFO"
flights %>%
filter(month==10) %>%
group_by(dest) %>%
summarise(ndest = n()) %>%
arrange(desc(ndest)) %>%
pull(dest) %>% .[1:5]
#[1] "ORD" "ATL" "LAX" "BOS" "CLT"
Well, it's probably hard to deny that my results dont differ from those of the conclusive test.
It is also very clear that neither in January nor in February the direction CLT
was not one of the 5 most popular destinations !!
However, if we compare it with the result expected and given by @Emman, I have to conclude that this expectation is inconsistent with the initial assumption!
## carrier month ATL ORD BOS MCO FLL LAX SFO CLT
## <chr> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 9E 1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 2 9E 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 3 9E 3 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 4 9E 4 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 5 9E 5 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 6 9E 6 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 7 9E 7 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 8 9E 8 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 9 9E 9 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 10 9E 10 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## # ... with 175 more rows
From the above data taken from the author of the question, it can be concluded that the CLT
direction was one of the five most desirable directions from January to October.
Meanwhile, only July, September and October are correct.
Although I haven't run any performance tests yet, I would like to point out that even the fastest solution will be of no use if I return incorrect results.
Now a little bit of defense of your own solution. I know, I know, it sounds very immodest.
First of all, I got everything I needed in three simple and clear steps with one uncomplicated mutation.
Second, in the whole process, I didn't need any intermediate tables.
Thirdly, I kept the original form of the data, supplementing it only with the carrierToToDest variable, which means a flight to one of the top 5 directions, which will greatly facilitate subsequent filtering and further work on this data.
So let me remind you what needs to be done and re-assemble all the code we need below.
library(nycflights13)
library(tidyverse)
ftopDest = function(data, ntop){
data %>%
group_by(dest) %>%
summarise(ndest = n()) %>%
arrange(desc(ndest)) %>%
pull(dest) %>% .[1:ntop]
}
carrierToTopDest = function(data, topDest){
data %>% mutate(carrierToToDest = dest %in% topDest)
}
df = flights %>% nest_by(year, month) %>% #Step 1
mutate(topDest = list(ftopDest(data, 5)), #Step 2
data = list(carrierToTopDest(data, topDest))) #Step 3
I will also remind you how to receive the most popular destinations in individual months.
df %>% mutate(topDest = paste(topDest, collapse = " ")) %>%
select(topDest)
output
# A tibble: 12 x 3
# Rowwise: year, month
year month topDest
<int> <int> <chr>
1 2013 1 ATL ORD BOS MCO FLL
2 2013 2 ATL ORD BOS MCO FLL
3 2013 3 ATL ORD BOS MCO FLL
4 2013 4 ATL ORD LAX BOS MCO
5 2013 5 ORD ATL LAX BOS SFO
6 2013 6 ORD ATL LAX BOS SFO
7 2013 7 ORD ATL LAX BOS CLT
8 2013 8 ORD ATL LAX BOS SFO
9 2013 9 ORD LAX ATL BOS CLT
10 2013 10 ORD ATL LAX BOS CLT
11 2013 11 ATL ORD LAX BOS CLT
12 2013 12 ATL LAX MCO ORD CLT
In turn, data recovery in its original form (along with the new variable carrierToToDest
) can be obtained in this way
df %>% select(-topDest) %>% unnest(data)
output
# A tibble: 336,776 x 20
# Groups: year, month [12]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA
6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR
7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR
8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA
9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK
10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA
# ... with 336,766 more rows, and 7 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, carrierToToDest <lgl>
However, if I would like to present this data in a form similar to that expected by @Emman, I can always do it like this.
df %>% select(-topDest) %>%
unnest(data) %>%
filter(carrierToToDest) %>%
group_by(carrier, month, dest) %>%
summarise(v= T, .groups="drop") %>%
pivot_wider(names_from = dest, values_from = v)
output
# A tibble: 125 x 10
carrier month ATL BOS ORD CLT FLL MCO LAX SFO
<chr> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 9E 1 TRUE TRUE TRUE NA NA NA NA NA
2 9E 2 TRUE TRUE TRUE NA NA NA NA NA
3 9E 3 TRUE TRUE TRUE NA NA NA NA NA
4 9E 4 NA TRUE TRUE NA NA NA NA NA
5 9E 5 TRUE TRUE TRUE NA NA NA NA NA
6 9E 6 NA TRUE TRUE NA NA NA NA NA
7 9E 7 NA TRUE TRUE TRUE NA NA NA NA
8 9E 8 NA TRUE TRUE NA NA NA NA NA
9 9E 9 NA TRUE TRUE TRUE NA NA NA NA
10 9E 10 NA TRUE TRUE TRUE NA NA NA NA
# ... with 115 more rows
The main difference is that the data is correct against the assumptions though instead of FALSE
it has the values NA
.
Of course, nothing prevents you from adding mutate_if(is.logical, ~ifelse(is.na(.x), FALSE, .x))
at the end, which will replace every occurrence of NA
with FALSE
.
The organization of data in the form I propose will also allow you to easily extract additional statistics and various useful information. For example, if you are interested in which carrier is carrying the most flights to the most popular destinations, you can do this:
df %>% select(-topDest) %>%
unnest(data) %>%
group_by(carrier, carrierToToDest) %>%
summarise(n = n(), .groups="drop") %>%
pivot_wider(names_from = carrierToToDest, values_from = n) %>%
mutate(prop = `TRUE`/`FALSE`)%>%
arrange(desc(prop))
output
# A tibble: 16 x 4
carrier `FALSE` `TRUE` prop
<chr> <int> <int> <dbl>
1 FL 923 2337 2.53
2 VX 2387 2775 1.16
3 US 12866 7670 0.596
4 DL 31978 16132 0.504
5 AA 21793 10936 0.502
6 UA 39719 18946 0.477
7 YV 434 167 0.385
8 B6 43170 11465 0.266
9 MQ 21146 5251 0.248
10 9E 16464 1996 0.121
11 EV 50967 3206 0.0629
12 OO 31 1 0.0323
13 WN 12216 59 0.00483
14 AS 714 NA NA
15 F9 685 NA NA
16 HA 342 NA NA
As you can see on an annual basis, FL
had the most monthly flights to the most popular destinations.
On the other hand, AS
, F9
and HA
never made such flights.
But maybe you are interested in it on a monthly basis. Nothing simpler. Just do this:
df %>% select(-topDest) %>%
unnest(data) %>%
group_by(month, carrier, carrierToToDest) %>%
summarise(n = n(), .groups="drop") %>%
pivot_wider(names_from = carrierToToDest, values_from = n) %>%
mutate(prop = `TRUE`/`FALSE`) %>%
arrange(desc(prop))
output
# A tibble: 185 x 5
month carrier `FALSE` `TRUE` prop
<int> <chr> <int> <int> <dbl>
1 5 VX 31 465 15
2 6 VX 30 450 15
3 8 VX 31 458 14.8
4 9 YV 9 33 3.67
5 10 FL 58 178 3.07
6 5 FL 85 240 2.82
7 4 FL 82 229 2.79
8 3 FL 85 231 2.72
9 2 FL 80 216 2.7
10 1 FL 89 239 2.69
# ... with 175 more rows
As you can see here the winner is VX
, which in May, June and August made a flight 15 times more often to the top 5 place than to other places.
Forgive me for not doing a performance test yet. Maybe soon. However, to all those who would like to make comparisons, please take into account two very important facts. First, I kept the dataframe in its original form. Secondly, I make the determination of the most popular directions, as it were, inside my calculations. Please include this in your possible performance tests.
Of course, I think I can be wrong somewhere. Maybe I misread the author of the question? English is not my mother tongue so I could have made a mistake reading the assumptions. However, I don't know where the bug is or why our results differ.
I improved my data.table
solution with the following one
thomas_data.table2 <- function() {
library(data.table)
dcast(
data.table(dest = my_flights_top_dest_across_months)[
unique(setDT(my_flights_raw)),
on = .(dest)
],
carrier + month ~ dest
)[
,
.(carrier, month, .SD[, my_flights_top_dest_across_months, with = FALSE] > 0)
]
}
and the benchmark is seen as below
Here is the benchmarking script:
library(nycflights13)
library(dplyr, warn.conflicts = FALSE)
# OP original
my_flights_raw <-
flights %>%
select(carrier, month, dest)
my_flights_agg <-
my_flights_raw %>%
count(month, dest, name = "n_obs") %>%
arrange(month, desc(n_obs))
my_flights_top_dest_across_months <-
my_flights_agg %>%
group_by(month) %>%
slice_max(order_by = n_obs, n = 5) %>%
pull(dest) %>%
unique()
my_flights_top_5_by_month <-
my_flights_agg %>%
group_by(month) %>%
slice_max(order_by = n_obs, n = 5)
my_flights_top_dest_across_month <- unique(my_flights_top_5_by_month$dest)
op_slow <- function() {
library(tidyr)
library(tibble)
library(purrr)
my_flights_raw %>%
group_by(carrier, month) %>%
summarise(destinations_vec = list(unique(dest))) %>%
add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x), .keep = "unused") %>%
mutate(across(are_top_dest_included, ~ purrr::map(.x = ., .f = ~ setNames(object = .x, nm = my_flights_top_dest_across_month)))) %>%
tidyr::unnest_wider(are_top_dest_included)
}
# OP collapse
op_collapse <- function() {
library(magrittr)
library(collapse)
library(data.table)
my_flights_raw %>%
collapse::funique() %>%
collapse::fgroup_by(carrier, month) %>%
collapse::fsummarise(nested_dest = list(dest)) %>%
collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
collapse::qTBL()
}
# Thomas data.table
thomas_data.table1 <- function() {
library(data.table)
my_flights_top_dest_across_months <-
data.table(
dest = unique(my_flights_top_5_by_month$dest),
fd = 1
)
dcast(my_flights_top_dest_across_months[
setDT(my_flights_raw),
on = .(dest)
],
carrier + month ~ dest,
fun.aggregate = function(x) sum(x) > 0
)[, c(
"carrier", "month",
my_flights_top_dest_across_months$dest
), with = FALSE]
}
thomas_data.table2 <- function() {
library(data.table)
dcast(
data.table(dest = my_flights_top_dest_across_months)[
unique(setDT(my_flights_raw)),
on = .(dest)
],
carrier + month ~ dest
)[
,
.(carrier, month, .SD[, my_flights_top_dest_across_months, with = FALSE] > 0)
]
}
# output_op_slow <- op_slow()
# output_op_collapse <- op_collapse()
# output_thomas1 <- thomas_data.table1()
# output_thomas2 <- thomas_data.table2()
# #> Using 'month' as value column. Use 'value.var' to override
# waldo::compare(output_op_slow, output_op_collapse, ignore_attr = TRUE)
# #> v No differences
# waldo::compare(output_op_slow, as_tibble(output_thomas1), ignore_attr = TRUE)
# #> v No differences
bm <- bench::mark(
op_slow = op_slow(),
op_collapse = op_collapse(),
thomas_dt1 = thomas_data.table1(),
thomas_dt2 = thomas_data.table2(),
check = FALSE,
iterations = 100L
)
ggplot2::autoplot(bm)
Given my_flights_top_5_by_month
and my_flights_raw
, we can try the following data.table
approach
library(data.table)
my_flights_top_dest_across_months <- data.table(
dest = unique(my_flights_top_5_by_month$dest),
fd = 1
)
dcast(my_flights_top_dest_across_months[
setDT(my_flights_raw),
on = .(dest)
],
carrier + month ~ dest,
fun.aggregate = function(x) sum(x) > 0
)[, c(
"carrier", "month",
my_flights_top_dest_across_months$dest
), with = FALSE]
which gives
carrier month ATL ORD BOS MCO FLL LAX SFO CLT
1: 9E 1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
2: 9E 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
3: 9E 3 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
4: 9E 4 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
5: 9E 5 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
---
181: YV 8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
182: YV 9 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
183: YV 10 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
184: YV 11 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
185: YV 12 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Does this do what you want? As far as I can tell it matches your output but has more rows because it includes all months for all carriers; carrier
"OO" only has flights in 5 months and your version only shows those 5 months in the summary.
With the data as provided (336k rows), this takes a similar amount of time as your function, but it's faster as you deal with larger data. When I run these on data 100x as big after setting my_flights_raw <- my_flights_raw %>% tidyr::uncount(100)
, to make it 33M rows, the code below is about 40% faster.
Given the large number of groups you're dealing with, I expect this is a situation where data.table
will really shine with better performance.
library(tidyverse)
my_flights_raw %>%
count(carrier, month, dest) %>%
complete(carrier, month, dest) %>%
filter(dest %in% my_flights_top_dest_across_months) %>%
mutate(n = if_else(!is.na(n), TRUE, FALSE)) %>%
pivot_wider(names_from = dest, values_from = n)
I took a stub myself, using functions from the collapse
package.
library(magrittr)
library(collapse)
library(data.table)
my_flights_raw %>%
collapse::funique() %>%
collapse::fgroup_by(carrier, month) %>%
collapse::fsummarise(nested_dest = list(dest)) %>%
collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
collapse::qTBL()
Unsurprisingly, collapse
gives the fastest execution time. But I was surprised to see that @ThomasIsCoding's solution based on data.table
was slower than my original tidyverse
mix-and-match solution.
I also factored in the single data.table
dependency in Thomas's answer, compared to the variety of dependencies in my original method.
library(nycflights13)
library(dplyr, warn.conflicts = FALSE)
# OP original
my_flights_raw <-
flights %>%
select(carrier, month, dest)
my_flights_agg <-
my_flights_raw %>%
count(month, dest, name = "n_obs") %>%
arrange(month, desc(n_obs))
my_flights_top_dest_across_months <-
my_flights_agg %>%
group_by(month) %>%
slice_max(order_by = n_obs, n = 5) %>%
pull(dest) %>%
unique()
my_flights_top_5_by_month <-
my_flights_agg %>%
group_by(month) %>%
slice_max(order_by = n_obs, n = 5)
my_flights_top_dest_across_month <- unique(my_flights_top_5_by_month$dest)
op_slow <- function() {
library(tidyr)
library(tibble)
library(purrr)
my_flights_raw %>%
group_by(carrier, month) %>%
summarise(destinations_vec = list(unique(dest))) %>%
add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>%
mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month)) )) %>%
tidyr::unnest_wider(are_top_dest_included)
}
# OP collapse
op_collapse <- function() {
library(magrittr)
library(collapse)
library(data.table)
my_flights_raw %>%
collapse::funique() %>%
collapse::fgroup_by(carrier, month) %>%
collapse::fsummarise(nested_dest = list(dest)) %>%
collapse::ftransform(new_col = lapply(nested_dest, \(x) my_flights_top_dest_across_months %in% x)) %>%
collapse::fcompute(., data.table::transpose(new_col), keep = 1:2) %>%
setNames(c("carrier", "month", my_flights_top_dest_across_months)) %>%
collapse::qTBL()
}
# Thomas data.table
thomas_data.table <- function() {
library(data.table)
my_flights_top_dest_across_months <-
data.table(
dest = unique(my_flights_top_5_by_month$dest),
fd = 1
)
dcast(my_flights_top_dest_across_months[
setDT(my_flights_raw),
on = .(dest)
],
carrier + month ~ dest,
fun.aggregate = function(x) sum(x) > 0
)[, c(
"carrier", "month",
my_flights_top_dest_across_months$dest
), with = FALSE]
}
output_op_slow <- op_slow()
output_op_collapse <- op_collapse()
output_thomas <- thomas_data.table()
#> Using 'month' as value column. Use 'value.var' to override
waldo::compare(output_op_slow, output_op_collapse, ignore_attr = TRUE)
#> v No differences
waldo::compare(output_op_slow, as_tibble(output_thomas), ignore_attr = TRUE)
#> v No differences
bm <- bench::mark(op_slow = op_slow(),
op_collapse = op_collapse(),
thomas_dt = thomas_data.table(),
check = FALSE,
iterations = 100)
ggplot2::autoplot(bm)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With