Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Total revenue generated from a route in R

Tags:

r

permutation

I have a dataset with origin ("from"), destination ("to") and price as below:

from    to  price
A       B   28109
A       D   2356
A       E   4216
B       A   445789
B       D   123
D       A   45674
D       B   1979

I want to sum the price considering the return route as well. for example, A - B consists of the following data:

from    to  price
  A     B   28109
  B     A   445789

Then, take the sum of the price (28109+445789). The output will be like this:

route   total_price
A - B   473898
A - D   48030
A - E   4216
B - D   2102

I was thinking to run a for loop but my data size is very large (800k rows). Any help will be highly appreciated. Thanks a lot in advance.

like image 241
user3642360 Avatar asked Mar 20 '19 17:03

user3642360


3 Answers

You can do this by sorting the from-to pairs, then grouping on that sorted pair and summing.

Edit: See @JasonAizkalns' answer for tidyverse equivalent

library(data.table)
setDT(df)

df[, .(total_price = sum(price))
   , by = .(route = paste(pmin(from, to), '-', pmax(from, to)))]

#    route total_price
# 1: A - B      473898
# 2: A - D       48030
# 3: A - E        4216
# 4: B - D        2102

@Frank notes that this result hides the fact that route "A - E" is not complete, in the sense that there is no row of the original data with from == 'E' and to == 'A'. He's offered a good way of capturing that info (and more), and I've added some others below.

df[, .(total_price = sum(price), complete = .N > 1)
   , by = .(route = paste(pmin(from, to), '-', pmax(from, to)))]

#    route total_price complete
# 1: A - B      473898     TRUE
# 2: A - D       48030     TRUE
# 3: A - E        4216    FALSE
# 4: B - D        2102     TRUE

df[, .(total_price = sum(price), paths_counted = .(paste(from, '-', to)))
   , by = .(route = paste(pmin(from, to), '-', pmax(from, to)))]

#    route total_price paths_counted
# 1: A - B      473898   A - B,B - A
# 2: A - D       48030   A - D,D - A
# 3: A - E        4216         A - E
# 4: B - D        2102   B - D,D - B

Data used

df <- fread('
from    to  price
A       B   28109
A       D   2356
A       E   4216
B       A   445789
B       D   123
D       A   45674
D       B   1979')
like image 186
IceCreamToucan Avatar answered Sep 21 '22 02:09

IceCreamToucan


You could do a self-join and then things are pretty straightforward:

library(tidyverse)

df <- readr::read_table("
from    to  price
A       B   28109
A       D   2356
A       E   4216
B       A   445789
B       D   123
D       A   45674
D       B   1979
")

df %>%
  inner_join(df, by = c("from" = "to")) %>%
  filter(to == from.y) %>%
  mutate(
    route = paste(from, "-", to),
    total_price = price.x + price.y
  )
#> # A tibble: 6 x 7
#>   from  to    price.x from.y price.y route total_price
#>   <chr> <chr>   <dbl> <chr>    <dbl> <chr>       <dbl>
#> 1 A     B       28109 B       445789 A - B      473898
#> 2 A     D        2356 D        45674 A - D       48030
#> 3 B     A      445789 A        28109 B - A      473898
#> 4 B     D         123 D         1979 B - D        2102
#> 5 D     A       45674 A         2356 D - A       48030
#> 6 D     B        1979 B          123 D - B        2102

Created on 2019-03-20 by the reprex package (v0.2.1)

Because I like @IceCreamToucan's answer better, here's the tidyverse equivalent:

df %>%
  group_by(route = paste(pmin(from, to), "-", pmax(from, to))) %>%
  summarise(total_price = sum(price))
like image 24
JasonAizkalns Avatar answered Sep 19 '22 02:09

JasonAizkalns


Also one tidyverse possibility:

df %>%
 nest(from, to) %>%
 mutate(route = unlist(map(data, function(x) paste(sort(x), collapse = "_")))) %>%
 group_by(route) %>%
 summarise(total_price = sum(price))

  route total_price
  <chr>       <int>
1 A_B        473898
2 A_D         48030
3 A_E          4216
4 B_D          2102

In this case, it, first, creates a list composed of values "from" and "to" variables. Second, it sorts the elements in the list and combines them together, separated by _. Finally, it groups by the combined elements and gets the sum.

Or involving a wide-to-long transformation:

df %>%
 rowid_to_column() %>%
 gather(var, val, -c(rowid, price)) %>%
 arrange(rowid, val) %>%
 group_by(rowid) %>%
 summarise(route = paste(val, collapse = "_"),
           price = first(price)) %>%
 group_by(route) %>%
 summarise(total_price = sum(price))

For this, it, first, performs a wide-to-long data transformation, excluding the row ID and "price". Second, it arranges the data according row ID and values contained in "from" and "to". Third, it groups by row ID, combines the elements together, separated by _. Finally, it groups by this variable and gets the sum.

like image 29
tmfmnk Avatar answered Sep 19 '22 02:09

tmfmnk