Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set R data.table row order by chaining 2 columns

Tags:

r

data.table

I'm trying to figure out how to order an R data table based on the chaining of 2 columns.

Here's my sample data.table.

dt <- data.table(id = c('A', 'A', 'A', 'A', 'A')
         , col1 = c(7521, 0, 7915, 5222, 5703)
         , col2 = c(7907, 5703, 8004, 7521, 5222))

   id col1 col2
1:  A 7521 7907
2:  A    0 5703
3:  A 7915 8004
4:  A 5222 7521
5:  A 5703 5222

I need the row order to start with col1 = 0. The col1 value in row 2 should be equal to the value of col2 in the preceding row, and so on.

Additionally, there generally should always be a matching value that chains the row order. But if not, it should select the closest value (see rows 4 & 5 below).

The outcome I'm looking for is shown below:

   id col1 col2
1:  A    0 5703
2:  A 5703 5222
3:  A 5222 7521
4:  A 7521 7907
5:  A 7915 8004

I think I can write a crazy function to do this.. but I'm wondering if there's an elegant data.table solution.

EDIT
I updated the table to include an additional ID with duplicate rows, and a unique source column:

dt <- data.table(id = c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B')
               , col1 = c(7521, 0, 7915, 5222, 5703, 1644, 1625, 0, 1625, 1625)
               , col2 = c(7907, 5703, 8004, 7521, 5222, 1625, 1625, 1644, 1625, 1505)
               , source = c('c', 'b', 'a', 'e', 'd', 'y', 'z', 'x', 'w', 'v'))

    id col1 col2 source
 1:  A 7521 7907      c
 2:  A    0 5703      b
 3:  A 7915 8004      a
 4:  A 5222 7521      e
 5:  A 5703 5222      d
 6:  B 1644 1625      y
 7:  B 1625 1625      z
 8:  B    0 1644      x
 9:  B 1625 1625      w
10:  B 1625 1505      v

There can be matching values within an ID. See B, rows 7 & 9 above. However, there's a unique source for each row where this data comes from.

The desired output would be:

    id col1 col2 source
 1:  A    0 5703      b
 2:  A 5703 5222      d
 3:  A 5222 7521      e
 4:  A 7521 7907      c
 5:  A 7915 8004      a
 6:  B    0 1644      x
 7:  B 1644 1625      y
 8:  B 1625 1625      w
 9:  B 1625 1625      z
10:  B 1625 1625      v

In the output, the matching rows, 8 & 9 could be in any order.

Thanks!

like image 868
AlexP Avatar asked Apr 13 '20 14:04

AlexP


People also ask

How do I change the order of rows in R?

To change the row order in an R data frame, we can use single square brackets and provide the row order at first place.

How do I sort a DataTable in R?

To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order.

How to append row to DataTable?

New rows can be added to a DataTable very easily using the row. add() API method. Simply call the API function with the data that is to be used for the new row (be it an array or object). Multiple rows can be added using the rows.


1 Answers

Here is an option using igraph with data.table:

#add id in front of cols to distinguishes them as vertices
cols <- paste0("col", 1L:2L)
dt[, (cols) := lapply(.SD, function(x) paste0(id, x)), .SDcols=cols]

#permutations of root nodes and leaf nodes
chains <- dt[, CJ(root=setdiff(col1, col2), leaf=setdiff(col2, col1)), id]

#find all paths from root nodes to leaf nodes
#note that igraph requires vertices to be of character type
library(igraph)
g <- graph_from_data_frame(dt[, .(col1, col2)])
l <- lapply(unlist(
  apply(chains, 1L, function(x) all_simple_paths(g, x[["root"]], x[["leaf"]])), 
  recursive=FALSE), names)
links <- data.table(g=rep(seq_along(l), lengths(l)), col1=unlist(l))

#look up edges
dt[links, on=.(col1), nomatch=0L]

output:

    id  col1  col2 source g
 1:  A    A0 A5703      b 1
 2:  A A5703 A5222      d 1
 3:  A A5222 A7521      e 1
 4:  A A7521 A7907      c 1
 5:  A A7915 A8004      a 2
 6:  B    B0 B1644      x 3
 7:  B B1644 B1625      y 3
 8:  B B1625 B1625      z 3
 9:  B B1625 B1625      w 3
10:  B B1625 B1505      v 3

data:

library(data.table)
dt <- data.table(id = c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B')
  , col1 = c(7521, 0, 7915, 5222, 5703, 1644, 1625, 0, 1625, 1625)
  , col2 = c(7907, 5703, 8004, 7521, 5222, 1625, 1625, 1644, 1625, 1505)
  , source = c('c', 'b', 'a', 'e', 'd', 'y', 'z', 'x', 'w', 'v'))
like image 182
chinsoon12 Avatar answered Oct 02 '22 21:10

chinsoon12