Assume the following set of original transactions: <pre class="prettyprint"><code>library(tidyverse) original_transactions <- data.frame( row = 1:6, start = 0, change = runif(6, min = -10, max = 10) %>% round(2), end = 0 ) %>% mutate( temp = cumsum(change), end = 100 + temp, # End balance start = end - change # Start balance ) %>% select( -temp ) </code></pre> <img src="https://i.stack.imgur.com/VtZsz.png" alt="enter image description here"> It shows a (chronological) sequence of transactions with a starting balance of $100.00 and an ending balance of $95.65, with six transactions/changes. Now assume that you receive a jumbled version of this <pre class="prettyprint"><code>transactions <- original_transactions %>% sample_n( 6 ) %>% mutate( row = row_number() # Original sequence is unknown ) </code></pre> <img src="https://i.stack.imgur.com/2lfQl.png" alt="enter image description here"> How can I reverse-engineer the sequence in R? That is, to get the sort order of <code>transactions</code> to match that of <code>original_transactions</code>? Ideally I'd like to do this using <code>dplyr</code> and a sequence of pipes <code>%>%</code> and avoid loops. Assume that the start/end balances will be unique and that, in general, the number of transactions can vary.

First, let <pre class="prettyprint"><code>original_transactions # row start change end # 1 1 100.00 2.33 102.33 # 2 2 102.33 -6.52 95.81 # 3 3 95.81 -4.20 91.61 # 4 4 91.61 -3.56 88.05 # 5 5 88.05 7.92 95.97 # 6 6 95.97 3.61 99.58 transactions # row start change end # 1 1 100.00 2.33 102.33 # 2 2 91.61 -3.56 88.05 # 3 3 95.81 -4.20 91.61 # 4 4 102.33 -6.52 95.81 # 5 5 88.05 7.92 95.97 # 6 6 95.97 3.61 99.58 </code></pre> and <pre class="prettyprint"><code>diffs <- outer(transactions$start, transactions$start, `-`) matches <- abs(sweep(diffs, 2, transactions$change, `-`)) < 1e-3 </code></pre> I guess that computing <code>diffs</code> is the most computationally expensive part in the whole solution. <code>diffs</code> has all possible differences between <code>start</code> of your <code>transactions</code>. Then comparing those with the <code>change</code> column in <code>matches</code> we know which pairs of rows of <code>transactions</code> should go together. If there were no problems regarding numeric precision, we could then use the <code>match</code> function and be done quickly. In this case, however, we have the following two options. <hr> First, we may use <code>igraph</code>. <pre class="prettyprint"><code>library(igraph) (g <- graph_from_adjacency_matrix(t(matches) * 1)) # IGRAPH 45d33f0 D--- 6 5 -- # + edges from 45d33f0: # [1] 1->4 2->5 3->2 4->3 5->6 </code></pre> That is, we have a hidden path graph: 1->4->3->2->5->6 which we want to recover. It is given by the longest path from the vertex which has no incoming edges (which is <code>1</code>): <pre class="prettyprint"><code>transactions[as.vector(tail(all_simple_paths(g, from = which(rowSums(matches) == 0)), 1)[[1]]), ] # row start change end # 1 1 100.00 2.33 102.33 # 4 4 102.33 -6.52 95.81 # 3 3 95.81 -4.20 91.61 # 2 2 91.61 -3.56 88.05 # 5 5 88.05 7.92 95.97 # 6 6 95.97 3.61 99.58 </code></pre> <hr> Another option is recursive. <pre class="prettyprint"><code>fun <- function(x, path = x) { if(length(xNew <- which(matches[, x])) > 0) fun(xNew, c(path, xNew)) else path } transactions[fun(which(rowSums(matches) == 0)), ] # row start change end # 1 1 100.00 2.33 102.33 # 4 4 102.33 -6.52 95.81 # 3 3 95.81 -4.20 91.61 # 2 2 91.61 -3.56 88.05 # 5 5 88.05 7.92 95.97 # 6 6 95.97 3.61 99.58 </code></pre> It uses the same unique longest path graph idea as the previous approach. <hr> No explicit loops... And of course you may rewrite everything with <code>%>%</code>, but it won't be as pretty as you want; this is not really a traditional data transformation task where <code>dplyr</code> is best.

Sorting a list of random transactions using dplyr

Tags:

sorting

r

dplyr

Assume the following set of original transactions:

library(tidyverse)

original_transactions <- data.frame(
  row = 1:6,
  start = 0,
  change = runif(6, min = -10, max = 10) %>% round(2),
  end = 0
) %>% mutate(
  temp = cumsum(change),
  end = 100 + temp, # End balance
  start = end - change # Start balance
) %>% select(
  -temp
)

enter image description here

It shows a (chronological) sequence of transactions with a starting balance of $100.00 and an ending balance of $95.65, with six transactions/changes.

Now assume that you receive a jumbled version of this

transactions <- original_transactions %>% sample_n(
  6
) %>% mutate(
  row = row_number() # Original sequence is unknown
)

enter image description here

How can I reverse-engineer the sequence in R? That is, to get the sort order of transactions to match that of original_transactions? Ideally I'd like to do this using dplyr and a sequence of pipes %>% and avoid loops.

Assume that the start/end balances will be unique and that, in general, the number of transactions can vary.

746

asked Dec 20 '18 05:12

Werner

1 Answers

First, let

original_transactions
#   row  start change    end
# 1   1 100.00   2.33 102.33
# 2   2 102.33  -6.52  95.81
# 3   3  95.81  -4.20  91.61
# 4   4  91.61  -3.56  88.05
# 5   5  88.05   7.92  95.97
# 6   6  95.97   3.61  99.58

transactions
#   row  start change    end
# 1   1 100.00   2.33 102.33
# 2   2  91.61  -3.56  88.05
# 3   3  95.81  -4.20  91.61
# 4   4 102.33  -6.52  95.81
# 5   5  88.05   7.92  95.97
# 6   6  95.97   3.61  99.58

and

diffs <- outer(transactions$start, transactions$start, `-`)
matches <- abs(sweep(diffs, 2, transactions$change, `-`)) < 1e-3

I guess that computing diffs is the most computationally expensive part in the whole solution. diffs has all possible differences between start of your transactions. Then comparing those with the change column in matches we know which pairs of rows of transactions should go together. If there were no problems regarding numeric precision, we could then use the match function and be done quickly. In this case, however, we have the following two options.

First, we may use igraph.

library(igraph)
(g <- graph_from_adjacency_matrix(t(matches) * 1))
# IGRAPH 45d33f0 D--- 6 5 -- 
# + edges from 45d33f0:
# [1] 1->4 2->5 3->2 4->3 5->6

That is, we have a hidden path graph: 1->4->3->2->5->6 which we want to recover. It is given by the longest path from the vertex which has no incoming edges (which is 1):

transactions[as.vector(tail(all_simple_paths(g, from = which(rowSums(matches) == 0)), 1)[[1]]), ]
#   row  start change    end
# 1   1 100.00   2.33 102.33
# 4   4 102.33  -6.52  95.81
# 3   3  95.81  -4.20  91.61
# 2   2  91.61  -3.56  88.05
# 5   5  88.05   7.92  95.97
# 6   6  95.97   3.61  99.58

Another option is recursive.

fun <- function(x, path = x) {
  if(length(xNew <- which(matches[, x])) > 0)
    fun(xNew, c(path, xNew))
  else path
}
transactions[fun(which(rowSums(matches) == 0)), ]
#   row  start change    end
# 1   1 100.00   2.33 102.33
# 4   4 102.33  -6.52  95.81
# 3   3  95.81  -4.20  91.61
# 2   2  91.61  -3.56  88.05
# 5   5  88.05   7.92  95.97
# 6   6  95.97   3.61  99.58

It uses the same unique longest path graph idea as the previous approach.

No explicit loops... And of course you may rewrite everything with %>%, but it won't be as pretty as you want; this is not really a traditional data transformation task where dplyr is best.

answered Oct 28 '22 16:10

Julius Vainora

Related questions
                            
                                Filter data frame matching all values of a vector
                            
                                R Shiny: How can I return reactive values from a shiny module to the master server function?
                            
                                How is the line width (size) defined in ggplot2?
                            
                                Coloring ggplot2 axis tick labels based on data displayed at axis tick positions
                            
                                Website not updating with blogdown & hugo
                            
                                How to set up asymmetrical color gradient for a numerical variable in leaflet in R
                            
                                dplyr mutate using variable columns
                            
                                Is it possible to pass Login username details of shinyproxy to shiny app?
                            
                                dplyr::slice in data.table [duplicate]
                            
                                Plotting means on histograms created with facet_wrap
                            
                                What exactly does Source on Save mean or do?
                            
                                GitLab CI with r testthat package
                            
                                Slowdown with repeated calls to spark dataframe in memory
                            
                                Get expression that evaluated to dot in function called by `magrittr` pipe
                            
                                What is the algorithm behind R core's `split` function?
                            
                                Table and Figure cross-reference officer R
                            
                                Create a concentric circle legend for a ggplot bubble chart
                            
                                R random number generator faulty?
                            
                                In a named argument to dplyr::funs, can I reference the names of other arguments?
                            
                                When exactly does data.table preserve column names?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With