I have an igraph object mygraph
with ~10,000 nodes and ~145,000 edges, and I need to create a number of subgraphs from this graph but with different sizes.
What I need is to create subgraphs from a determined size (from 5 nodes to 500 nodes) where all the nodes are connected in each subgraph. I need to create ~1,000 subgraphs for each size (i.e, 1000 subgraphs for size5, 1000 for size 6, and so on), and then calculate some values for each graph according to different node attributes.
I have some code but it takes a long time to do all the calculations. I thought in using the graphlets
function in order to get the different sizes but every time I run it on my computer it crash due to memory issues.
Here is the code I am using:
First step was to create a function to create the subgraphs of different sizes and do the calculations needed.
random_network<-function(size,G){
score_fun<-function(g){
subsum <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
subsum
}
genes.idx <- V(G)$name
perm <- c()
while(length(perm)<1000){
seed<-sample(genes.idx,1)
while( length(seed)<size ){
tmp.neigh <- V(G)[unlist(neighborhood(G,1,seed))]$name
tmp.neigh <- setdiff(tmp.neigh, seed)
if( length(tmp.neigh)>0 )
seed<-c(seed,sample(tmp.neigh,1)) else break
}
if( length(seed)==size )
perm <- c(perm,score_fun(induced.subgraph(G,seed)))
}
perm
}
Second step was to apply the function to the actual graph
### generate some example data
library(igraph)
my_graph <- erdos.renyi.game(10000, 0.0003)
V(my_graph)$name <- 1:vcount(my_graph)
V(my_graph)$weight <- rnorm(10000)
V(my_graph)$RWRNodeweight <- runif(10000, min=0, max=0.05)
### Run the code to get the subgraphs from different size and do calculations based on nodes
genesets.length<- seq(5:500)
genesets.length.null.dis <- list()
for(k in 5:max(genesets.length){
genesets.length.null.dis[[as.character(k)]] <- random_network(size=k,G=my_graph)
}
Basically your algorithm for sampling a graph could be described as initializing the node set as a randomly selected node and then iteratively adding a neighbor of your current set until either there are no more neighbors or you have the desired subset size.
The expensive repeated operation here is determining the neighbors of the current set, which you do with the following:
tmp.neigh <- V(G)[unlist(neighborhood(G,1,seed))]$name
tmp.neigh <- setdiff(tmp.neigh, seed)
In short, you are looking at the neighbors of each node in your selected subset at each iteration. A more efficient approach would be to store a vector of neighbors and update it each time you add a new node; this will be more efficient because you only need to consider the neighbors of the new node.
josilber <- function(size, num.rep, G) {
score_fun <- function(vert) sum(vert$weight*vert$RWRNodeweight)/sqrt(sum(vert$RWRNodeweight^2))
n <- length(V(G)$name)
# Determine which nodes fall in sufficiently large connected components
comp <- components(G)
valid <- which(comp$csize[comp$membership] >= size)
perm <- replicate(num.rep, {
first.node <- sample(valid, 1)
used <- (1:n) == first.node # Is this node selected?
neigh <- (1:n) %in% neighbors(G, first.node) # Does each node neighbor our selections?
for (iter in 2:size) {
new.node <- sample(which(neigh & !used), 1)
used[new.node] <- TRUE
neigh[neighbors(G, new.node)] <- TRUE
}
score_fun(V(G)[used])
})
perm
}
For a single replicate, this yields significant speedups over a single replicate of the code in the question:
As mentioned in the other answers, parallelization could further improve the performance of doing 1000 replicates for a given graph size. The following uses the doParallel
package to parallelize the repeated step (the adjustment is pretty much identical to the one made by @Chris in his answer):
library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
josilber2 <- function(size, num.rep, G) {
score_fun <- function(vert) sum(vert$weight*vert$RWRNodeweight)/sqrt(sum(vert$RWRNodeweight^2))
n <- length(V(G)$name)
# Determine which nodes fall in sufficiently large connected components
comp <- components(G)
valid <- which(comp$csize[comp$membership] >= size)
perm <- foreach (i=1:num.rep, .combine='c') %dopar% {
library(igraph)
first.node <- sample(valid, 1)
used <- (1:n) == first.node # Is this node selected?
neigh <- (1:n) %in% neighbors(G, first.node) # Does each node neighbor our selections?
for (iter in 2:size) {
new.node <- sample(which(neigh & !used), 1)
used[new.node] <- TRUE
neigh[neighbors(G, new.node)] <- TRUE
}
score_fun(V(G)[used])
}
perm
}
On my Macbook Air, josilber(100, 1000, my_graph)
takes 670 seconds to run (this is the non-parallel version), while josilber2(100, 1000, my_graph)
takes 239 seconds to run (this is the parallel version configured with 4 workers). For the size=100
case, we have therefore gotten a 20x speedup from code improvements and an additional 3x speedup from parallelization, for a total speedup of 60x.
I don't have a complete answer but here are some things to consider to help speed it up (assuming there is not a much faster approach using a different method).
Remove from your graph any any nodes which are not part of a component as large as you are looking for. It will really depend on your network structure but it looks like your networks are genes so there are likely many genes with very low degree and you could get some speedups by removing them. Something like this code:
cgraph <- clusters(G)
tooSmall <- which(cgraph$csize < size)
toKeep <- setdiff(1:length(V(G)), which(cgraph$membership %in% tooSmall))
graph <- induced.subgraph(G, vids=toKeep)
Consider running this in parallel to take advantage of multiple cores. For example, using the parallel
package and mclapply
.
library(parallel)
genesets.length<- seq(5, 500)
names(genesets.length) <- genesets.length
genesets.length.null.dis <- mclapply(genesets.length, mc.cores=7,
function(length) {
random_network(size=length, G=my_graph)
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With