I have a dataset with origin ("from"), destination ("to") and price as below:
from to price
A B 28109
A D 2356
A E 4216
B A 445789
B D 123
D A 45674
D B 1979
I want to sum the price considering the return route as well. for example, A - B consists of the following data:
from to price
A B 28109
B A 445789
Then, take the sum of the price (28109+445789). The output will be like this:
route total_price
A - B 473898
A - D 48030
A - E 4216
B - D 2102
I was thinking to run a for loop but my data size is very large (800k rows). Any help will be highly appreciated. Thanks a lot in advance.
You can do this by sorting the from-to pairs, then grouping on that sorted pair and summing.
Edit: See @JasonAizkalns' answer for tidyverse equivalent
library(data.table)
setDT(df)
df[, .(total_price = sum(price))
, by = .(route = paste(pmin(from, to), '-', pmax(from, to)))]
# route total_price
# 1: A - B 473898
# 2: A - D 48030
# 3: A - E 4216
# 4: B - D 2102
@Frank notes that this result hides the fact that route "A - E"
is not complete, in the sense that there is no row of the original data with from == 'E'
and to == 'A'
. He's offered a good way of capturing that info (and more), and I've added some others below.
df[, .(total_price = sum(price), complete = .N > 1)
, by = .(route = paste(pmin(from, to), '-', pmax(from, to)))]
# route total_price complete
# 1: A - B 473898 TRUE
# 2: A - D 48030 TRUE
# 3: A - E 4216 FALSE
# 4: B - D 2102 TRUE
df[, .(total_price = sum(price), paths_counted = .(paste(from, '-', to)))
, by = .(route = paste(pmin(from, to), '-', pmax(from, to)))]
# route total_price paths_counted
# 1: A - B 473898 A - B,B - A
# 2: A - D 48030 A - D,D - A
# 3: A - E 4216 A - E
# 4: B - D 2102 B - D,D - B
Data used
df <- fread('
from to price
A B 28109
A D 2356
A E 4216
B A 445789
B D 123
D A 45674
D B 1979')
You could do a self-join and then things are pretty straightforward:
library(tidyverse)
df <- readr::read_table("
from to price
A B 28109
A D 2356
A E 4216
B A 445789
B D 123
D A 45674
D B 1979
")
df %>%
inner_join(df, by = c("from" = "to")) %>%
filter(to == from.y) %>%
mutate(
route = paste(from, "-", to),
total_price = price.x + price.y
)
#> # A tibble: 6 x 7
#> from to price.x from.y price.y route total_price
#> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 A B 28109 B 445789 A - B 473898
#> 2 A D 2356 D 45674 A - D 48030
#> 3 B A 445789 A 28109 B - A 473898
#> 4 B D 123 D 1979 B - D 2102
#> 5 D A 45674 A 2356 D - A 48030
#> 6 D B 1979 B 123 D - B 2102
Created on 2019-03-20 by the reprex package (v0.2.1)
Because I like @IceCreamToucan's answer better, here's the tidyverse
equivalent:
df %>%
group_by(route = paste(pmin(from, to), "-", pmax(from, to))) %>%
summarise(total_price = sum(price))
Also one tidyverse
possibility:
df %>%
nest(from, to) %>%
mutate(route = unlist(map(data, function(x) paste(sort(x), collapse = "_")))) %>%
group_by(route) %>%
summarise(total_price = sum(price))
route total_price
<chr> <int>
1 A_B 473898
2 A_D 48030
3 A_E 4216
4 B_D 2102
In this case, it, first, creates a list composed of values "from" and "to" variables. Second, it sorts the elements in the list and combines them together, separated by _
. Finally, it groups by the combined elements and gets the sum.
Or involving a wide-to-long transformation:
df %>%
rowid_to_column() %>%
gather(var, val, -c(rowid, price)) %>%
arrange(rowid, val) %>%
group_by(rowid) %>%
summarise(route = paste(val, collapse = "_"),
price = first(price)) %>%
group_by(route) %>%
summarise(total_price = sum(price))
For this, it, first, performs a wide-to-long data transformation, excluding the row ID and "price". Second, it arranges the data according row ID and values contained in "from" and "to". Third, it groups by row ID, combines the elements together, separated by _
. Finally, it groups by this variable and gets the sum.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With