Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing reverse duplicate rows

I have a data.table with two columns of genes and each row treated as a pair. Some gene pairs are duplicated with the order reversed. I am looking for a faster method, preferably without using a loop like the one I've provided, to keep unique pairs in my table.

library(data.table)
genes <- data.table(geneA = LETTERS[1:10], geneB = c("C", "G", "B", "E", "D", "I", "H", "J", "F", "A"))

revG <- genes[,.(geneA = geneB, geneB = geneA)]
d <- fintersect(genes, revG)

for (x in 1:nrow(d)) {
  entry <- d[,c(geneA[x], geneB[x])]; revEntry <- rev(entry)
  dupEntry <- d[geneA %chin% revEntry[1] & geneB %chin% revEntry[2]]
  if (nrow(dupEntry) > 0) {
    d <- d[!(geneA %chin% dupEntry[,geneA] & geneB %chin% dupEntry[,geneB])]
  }
}

The table object d contains the duplicated, reversed pairs. After the loop, one copy of each is remaining. I used the original genes table and took a subset, excluding the copies in d and storing the index. I have a list whose names are the same as the first column in genes. The index is used to filter the list based on the duplicate pairs that were removed with the loop.

idx <- genes[!(geneA %chin% d[,geneA] & geneB %chin% d[,geneB]), which = TRUE]

geneList <- vector("list", length = nrow(genes)); names(geneList) <- genes[,geneA]
geneList <- geneList[idx]

The above method isn't necessarily too slow, but I plan on using ~12K genes so the speed might be noticeable then. I found a question with the same problem posted but without using data.table. It uses an apply function to get the job done but that might also be slow when dealing with larger numbers.

like image 654
abbas786 Avatar asked Nov 08 '22 00:11

abbas786


1 Answers

I believe, what you are asking is similar to, given a list of permutations by 2, how can I get the combinations. This can be an option, using igraph.

library(data.table)
library(igraph)
genes <- data.table(geneA = LETTERS[1:10], geneB = c("C", "G", "B", "E", "D", "I", "H", "J", "F", "A"))
g <-graph_from_data_frame(genes, directed = F)
g <- simplify(g, remove.multiple = T, remove.loops = T)
get.data.frame(g)
  from to
1    A  C
2    A  J
3    B  C
4    B  G
5    D  E
6    F  I
7    G  H
8    H  J

#benchmark
set.seed(1283782)
fn1<-function(genes){
  g <-graph_from_data_frame(genes, directed = F)
  g <- simplify(g, remove.multiple = T, remove.loops = T)
  get.data.frame(g)}
genes <- data.table(geneA = sample(LETTERS, 20000, T), geneB = sample(LETTERS, 20000, T))
microbenchmark(fn1(genes), times = 1)
       expr      min       lq     mean   median       uq      max neval
 fn1(genes) 8.605717 8.605717 8.605717 8.605717 8.605717 8.605717     1
like image 98
Mario GS Avatar answered Nov 15 '22 08:11

Mario GS