I have a data.table
with two columns of genes and each row treated as a pair. Some gene pairs are duplicated with the order reversed. I am looking for a faster method, preferably without using a loop like the one I've provided, to keep unique pairs in my table.
library(data.table)
genes <- data.table(geneA = LETTERS[1:10], geneB = c("C", "G", "B", "E", "D", "I", "H", "J", "F", "A"))
revG <- genes[,.(geneA = geneB, geneB = geneA)]
d <- fintersect(genes, revG)
for (x in 1:nrow(d)) {
entry <- d[,c(geneA[x], geneB[x])]; revEntry <- rev(entry)
dupEntry <- d[geneA %chin% revEntry[1] & geneB %chin% revEntry[2]]
if (nrow(dupEntry) > 0) {
d <- d[!(geneA %chin% dupEntry[,geneA] & geneB %chin% dupEntry[,geneB])]
}
}
The table object d
contains the duplicated, reversed pairs. After the loop, one copy of each is remaining. I used the original genes table and took a subset, excluding the copies in d
and storing the index. I have a list whose names are the same as the first column in genes
. The index is used to filter the list based on the duplicate pairs that were removed with the loop.
idx <- genes[!(geneA %chin% d[,geneA] & geneB %chin% d[,geneB]), which = TRUE]
geneList <- vector("list", length = nrow(genes)); names(geneList) <- genes[,geneA]
geneList <- geneList[idx]
The above method isn't necessarily too slow, but I plan on using ~12K genes so the speed might be noticeable then. I found a question with the same problem posted but without using data.table
. It uses an apply
function to get the job done but that might also be slow when dealing with larger numbers.
I believe, what you are asking is similar to, given a list of permutations by 2, how can I get the combinations.
This can be an option, using igraph
.
library(data.table)
library(igraph)
genes <- data.table(geneA = LETTERS[1:10], geneB = c("C", "G", "B", "E", "D", "I", "H", "J", "F", "A"))
g <-graph_from_data_frame(genes, directed = F)
g <- simplify(g, remove.multiple = T, remove.loops = T)
get.data.frame(g)
from to
1 A C
2 A J
3 B C
4 B G
5 D E
6 F I
7 G H
8 H J
#benchmark
set.seed(1283782)
fn1<-function(genes){
g <-graph_from_data_frame(genes, directed = F)
g <- simplify(g, remove.multiple = T, remove.loops = T)
get.data.frame(g)}
genes <- data.table(geneA = sample(LETTERS, 20000, T), geneB = sample(LETTERS, 20000, T))
microbenchmark(fn1(genes), times = 1)
expr min lq mean median uq max neval
fn1(genes) 8.605717 8.605717 8.605717 8.605717 8.605717 8.605717 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With