Assume the following set of original transactions:
library(tidyverse)
original_transactions <- data.frame(
row = 1:6,
start = 0,
change = runif(6, min = -10, max = 10) %>% round(2),
end = 0
) %>% mutate(
temp = cumsum(change),
end = 100 + temp, # End balance
start = end - change # Start balance
) %>% select(
-temp
)
It shows a (chronological) sequence of transactions with a starting balance of $100.00 and an ending balance of $95.65, with six transactions/changes.
Now assume that you receive a jumbled version of this
transactions <- original_transactions %>% sample_n(
6
) %>% mutate(
row = row_number() # Original sequence is unknown
)
How can I reverse-engineer the sequence in R? That is, to get the sort order of transactions
to match that of original_transactions
? Ideally I'd like to do this using dplyr
and a sequence of pipes %>%
and avoid loops.
Assume that the start/end balances will be unique and that, in general, the number of transactions can vary.
We will use pipe operator “%>%” to feed the data to the dplyr function arrange(). We need to specify name of the variable that we want to sort dataframe. In this example, we are sorting by variable “body_mass_g”. dplyr's arrange() sorts the dataframe by the variable and outputs a new dataframe (as a tibble).
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order.
Sorting in ascending order is the default sorting order in arrange() function.
Similarly to readr , dplyr and tidyr are also part of the tidyverse. These packages were loaded in R's memory when we called library(tidyverse) earlier.
First, let
original_transactions
# row start change end
# 1 1 100.00 2.33 102.33
# 2 2 102.33 -6.52 95.81
# 3 3 95.81 -4.20 91.61
# 4 4 91.61 -3.56 88.05
# 5 5 88.05 7.92 95.97
# 6 6 95.97 3.61 99.58
transactions
# row start change end
# 1 1 100.00 2.33 102.33
# 2 2 91.61 -3.56 88.05
# 3 3 95.81 -4.20 91.61
# 4 4 102.33 -6.52 95.81
# 5 5 88.05 7.92 95.97
# 6 6 95.97 3.61 99.58
and
diffs <- outer(transactions$start, transactions$start, `-`)
matches <- abs(sweep(diffs, 2, transactions$change, `-`)) < 1e-3
I guess that computing diffs
is the most computationally expensive part in the whole solution. diffs
has all possible differences between start
of your transactions
. Then comparing those with the change
column in matches
we know which pairs of rows of transactions
should go together. If there were no problems regarding numeric precision, we could then use the match
function and be done quickly. In this case, however, we have the following two options.
First, we may use igraph
.
library(igraph)
(g <- graph_from_adjacency_matrix(t(matches) * 1))
# IGRAPH 45d33f0 D--- 6 5 --
# + edges from 45d33f0:
# [1] 1->4 2->5 3->2 4->3 5->6
That is, we have a hidden path graph: 1->4->3->2->5->6 which we want to recover. It is given by the longest path from the vertex which has no incoming edges (which is 1
):
transactions[as.vector(tail(all_simple_paths(g, from = which(rowSums(matches) == 0)), 1)[[1]]), ]
# row start change end
# 1 1 100.00 2.33 102.33
# 4 4 102.33 -6.52 95.81
# 3 3 95.81 -4.20 91.61
# 2 2 91.61 -3.56 88.05
# 5 5 88.05 7.92 95.97
# 6 6 95.97 3.61 99.58
Another option is recursive.
fun <- function(x, path = x) {
if(length(xNew <- which(matches[, x])) > 0)
fun(xNew, c(path, xNew))
else path
}
transactions[fun(which(rowSums(matches) == 0)), ]
# row start change end
# 1 1 100.00 2.33 102.33
# 4 4 102.33 -6.52 95.81
# 3 3 95.81 -4.20 91.61
# 2 2 91.61 -3.56 88.05
# 5 5 88.05 7.92 95.97
# 6 6 95.97 3.61 99.58
It uses the same unique longest path graph idea as the previous approach.
No explicit loops... And of course you may rewrite everything with %>%
, but it won't be as pretty as you want; this is not really a traditional data transformation task where dplyr
is best.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With