I'm trying to figure out how to order an R data table based on the chaining of 2 columns. Here's my sample data.table. <pre class="prettyprint lang-r prettyprint-override"><code>dt <- data.table(id = c('A', 'A', 'A', 'A', 'A') , col1 = c(7521, 0, 7915, 5222, 5703) , col2 = c(7907, 5703, 8004, 7521, 5222)) id col1 col2 1: A 7521 7907 2: A 0 5703 3: A 7915 8004 4: A 5222 7521 5: A 5703 5222 </code></pre> I need the row order to start with col1 = 0. The col1 value in row 2 should be equal to the value of col2 in the preceding row, and so on. Additionally, there generally should always be a matching value that chains the row order. But if not, it should select the closest value (see rows 4 & 5 below). The outcome I'm looking for is shown below: <pre class="prettyprint lang-r prettyprint-override"><code> id col1 col2 1: A 0 5703 2: A 5703 5222 3: A 5222 7521 4: A 7521 7907 5: A 7915 8004 </code></pre> I think I can write a crazy function to do this.. but I'm wondering if there's an elegant data.table solution. EDIT I updated the table to include an additional ID with duplicate rows, and a unique source column: <pre class="prettyprint lang-r prettyprint-override"><code>dt <- data.table(id = c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B') , col1 = c(7521, 0, 7915, 5222, 5703, 1644, 1625, 0, 1625, 1625) , col2 = c(7907, 5703, 8004, 7521, 5222, 1625, 1625, 1644, 1625, 1505) , source = c('c', 'b', 'a', 'e', 'd', 'y', 'z', 'x', 'w', 'v')) id col1 col2 source 1: A 7521 7907 c 2: A 0 5703 b 3: A 7915 8004 a 4: A 5222 7521 e 5: A 5703 5222 d 6: B 1644 1625 y 7: B 1625 1625 z 8: B 0 1644 x 9: B 1625 1625 w 10: B 1625 1505 v </code></pre> There can be matching values within an ID. See B, rows 7 & 9 above. However, there's a unique source for each row where this data comes from. The desired output would be: <pre class="prettyprint lang-r prettyprint-override"><code> id col1 col2 source 1: A 0 5703 b 2: A 5703 5222 d 3: A 5222 7521 e 4: A 7521 7907 c 5: A 7915 8004 a 6: B 0 1644 x 7: B 1644 1625 y 8: B 1625 1625 w 9: B 1625 1625 z 10: B 1625 1625 v </code></pre> In the output, the matching rows, 8 & 9 could be in any order. Thanks!

Here is an option using <code>igraph</code> with <code>data.table</code>: <pre class="prettyprint"><code>#add id in front of cols to distinguishes them as vertices cols <- paste0("col", 1L:2L) dt[, (cols) := lapply(.SD, function(x) paste0(id, x)), .SDcols=cols] #permutations of root nodes and leaf nodes chains <- dt[, CJ(root=setdiff(col1, col2), leaf=setdiff(col2, col1)), id] #find all paths from root nodes to leaf nodes #note that igraph requires vertices to be of character type library(igraph) g <- graph_from_data_frame(dt[, .(col1, col2)]) l <- lapply(unlist( apply(chains, 1L, function(x) all_simple_paths(g, x[["root"]], x[["leaf"]])), recursive=FALSE), names) links <- data.table(g=rep(seq_along(l), lengths(l)), col1=unlist(l)) #look up edges dt[links, on=.(col1), nomatch=0L] </code></pre> output: <pre class="prettyprint"><code> id col1 col2 source g 1: A A0 A5703 b 1 2: A A5703 A5222 d 1 3: A A5222 A7521 e 1 4: A A7521 A7907 c 1 5: A A7915 A8004 a 2 6: B B0 B1644 x 3 7: B B1644 B1625 y 3 8: B B1625 B1625 z 3 9: B B1625 B1625 w 3 10: B B1625 B1505 v 3 </code></pre> data: <pre class="prettyprint"><code>library(data.table) dt <- data.table(id = c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B') , col1 = c(7521, 0, 7915, 5222, 5703, 1644, 1625, 0, 1625, 1625) , col2 = c(7907, 5703, 8004, 7521, 5222, 1625, 1625, 1644, 1625, 1505) , source = c('c', 'b', 'a', 'e', 'd', 'y', 'z', 'x', 'w', 'v')) </code></pre>

Set R data.table row order by chaining 2 columns

Tags:

r

data.table

I'm trying to figure out how to order an R data table based on the chaining of 2 columns.

Here's my sample data.table.

dt <- data.table(id = c('A', 'A', 'A', 'A', 'A')
         , col1 = c(7521, 0, 7915, 5222, 5703)
         , col2 = c(7907, 5703, 8004, 7521, 5222))

   id col1 col2
1:  A 7521 7907
2:  A    0 5703
3:  A 7915 8004
4:  A 5222 7521
5:  A 5703 5222

I need the row order to start with col1 = 0. The col1 value in row 2 should be equal to the value of col2 in the preceding row, and so on.

Additionally, there generally should always be a matching value that chains the row order. But if not, it should select the closest value (see rows 4 & 5 below).

The outcome I'm looking for is shown below:

   id col1 col2
1:  A    0 5703
2:  A 5703 5222
3:  A 5222 7521
4:  A 7521 7907
5:  A 7915 8004

I think I can write a crazy function to do this.. but I'm wondering if there's an elegant data.table solution.

EDIT
I updated the table to include an additional ID with duplicate rows, and a unique source column:

dt <- data.table(id = c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B')
               , col1 = c(7521, 0, 7915, 5222, 5703, 1644, 1625, 0, 1625, 1625)
               , col2 = c(7907, 5703, 8004, 7521, 5222, 1625, 1625, 1644, 1625, 1505)
               , source = c('c', 'b', 'a', 'e', 'd', 'y', 'z', 'x', 'w', 'v'))

    id col1 col2 source
 1:  A 7521 7907      c
 2:  A    0 5703      b
 3:  A 7915 8004      a
 4:  A 5222 7521      e
 5:  A 5703 5222      d
 6:  B 1644 1625      y
 7:  B 1625 1625      z
 8:  B    0 1644      x
 9:  B 1625 1625      w
10:  B 1625 1505      v

There can be matching values within an ID. See B, rows 7 & 9 above. However, there's a unique source for each row where this data comes from.

The desired output would be:

    id col1 col2 source
 1:  A    0 5703      b
 2:  A 5703 5222      d
 3:  A 5222 7521      e
 4:  A 7521 7907      c
 5:  A 7915 8004      a
 6:  B    0 1644      x
 7:  B 1644 1625      y
 8:  B 1625 1625      w
 9:  B 1625 1625      z
10:  B 1625 1625      v

In the output, the matching rows, 8 & 9 could be in any order.

Thanks!

868

asked Apr 13 '20 14:04

AlexP

1 Answers

Here is an option using igraph with data.table:

#add id in front of cols to distinguishes them as vertices
cols <- paste0("col", 1L:2L)
dt[, (cols) := lapply(.SD, function(x) paste0(id, x)), .SDcols=cols]

#permutations of root nodes and leaf nodes
chains <- dt[, CJ(root=setdiff(col1, col2), leaf=setdiff(col2, col1)), id]

#find all paths from root nodes to leaf nodes
#note that igraph requires vertices to be of character type
library(igraph)
g <- graph_from_data_frame(dt[, .(col1, col2)])
l <- lapply(unlist(
  apply(chains, 1L, function(x) all_simple_paths(g, x[["root"]], x[["leaf"]])), 
  recursive=FALSE), names)
links <- data.table(g=rep(seq_along(l), lengths(l)), col1=unlist(l))

#look up edges
dt[links, on=.(col1), nomatch=0L]

output:

    id  col1  col2 source g
 1:  A    A0 A5703      b 1
 2:  A A5703 A5222      d 1
 3:  A A5222 A7521      e 1
 4:  A A7521 A7907      c 1
 5:  A A7915 A8004      a 2
 6:  B    B0 B1644      x 3
 7:  B B1644 B1625      y 3
 8:  B B1625 B1625      z 3
 9:  B B1625 B1625      w 3
10:  B B1625 B1505      v 3

data:

library(data.table)
dt <- data.table(id = c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B')
  , col1 = c(7521, 0, 7915, 5222, 5703, 1644, 1625, 0, 1625, 1625)
  , col2 = c(7907, 5703, 8004, 7521, 5222, 1625, 1625, 1644, 1625, 1505)
  , source = c('c', 'b', 'a', 'e', 'd', 'y', 'z', 'x', 'w', 'v'))

182

answered Oct 02 '22 21:10

chinsoon12

Related questions
                            
                                r group lag sum
                            
                                Apply over nested list names: Sub out character in nested list names
                            
                                use csl-file for pdf-output in bookdown
                            
                                How to change TOC depth in R Bookdown (GitBook)?
                            
                                data.table do not compute NA groups in by
                            
                                Remove border from geom_rect using ggplot2
                            
                                Preserve environment variables when spawning shiny processes within a container
                            
                                Tiny plot output from sankeyNetwork (NetworkD3) in Firefox
                            
                                Custom Loss Function in R Keras
                            
                                Spacing between legend keys in ggplot
                            
                                R Markdown conditionals for knitting HTML vs PDF
                            
                                Encoding issue with write.xlsx (openxlsx)
                            
                                R - Finding least cost path through raster image (maze)?
                            
                                "Reversed" use of fct_infreq() in ggplot2
                            
                                How to debug (line-by-line) Rcpp generated code in Windows?
                            
                                How do I capture the HTTP error code from a download.file request?
                            
                                Function which runs lm over different variables
                            
                                ggplot - panel borders alternating black and white rectangles
                            
                                ggadjustedcurves error: Must use a vector in '[', not an object of class matrix
                            
                                Efficiently picking combinations of Integers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With