I have a large data set of passengers per route similar to the following:
routes <- c("MEX-GDL", "ACA-MEX", "CUN-MTY", "MTY-CUN", "GDL-MEX", "MEX-ACA")
pax <- sample(100:500, size = 6, replace = T)
traffic <- data.frame(routes = routes, pax = pax)
routes pax
1 MEX-GDL 282
2 ACA-MEX 428
3 CUN-MTY 350
4 MTY-CUN 412
5 GDL-MEX 474
6 MEX-ACA 263
I want to group flights if the origin and destination match as to get the total number of passengers in the route - so for example renaming the route MEX-GDL
as GDL-MEX
or viceversa so I can then use group_by()
on the data set.
Kind of like this:
traffic %>% group_by(routes) %>% summarise(sum(pax))
I have done the following and it works, but I believe there can be a more efficient way to solve the problem (as it takes quite some time to run):
library(tidyverse)
traffic$routes <- as.character(traffic$routes)
for(route in traffic$routes){
a <- substring(route, first = 1, last = 3)
b <- substring(route, first = 5, last = 7)
aux <- which(sapply(traffic$routes, str_detect, pattern = paste0(b,"-",a)))
traffic$routes[aux] <- paste0(a,"-",b)
}
Any suggestions?
Thanks for the help!
Note: it's my first question here, so I hope I complied with all the guidelines.
We can separate
into two columns, grouped by the pmax
or pmin
, get the sum
library(tidyverse)
traffic %>%
separate(routes, into = c("Col1", "Col2")) %>%
group_by(ColN = pmin(Col1, Col2), ColN2 = pmax(Col1, Col2)) %>%
summarise(Sum = sum(pax))
data.table
version
data: (?I
READ THIS)
traffic <- data.frame(routes = I(routes), pax = pax)
library(data.table)
setDT(traffic)[,routes := sapply(strsplit(routes, split="-"), function(x) paste0(sort(x),collapse = "-"))][,.(Sum = sum(pax)), by = routes]
result: (values differ because of sample
function)
# routes Sum
#1: GDL-MEX 621
#2: ACA-MEX 595
#3: CUN-MTY 266
?sample
use ?set.seed
along with it.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With