Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sampling subgraphs from different sizes using igraph

I have an igraph object mygraph with ~10,000 nodes and ~145,000 edges, and I need to create a number of subgraphs from this graph but with different sizes. What I need is to create subgraphs from a determined size (from 5 nodes to 500 nodes) where all the nodes are connected in each subgraph. I need to create ~1,000 subgraphs for each size (i.e, 1000 subgraphs for size5, 1000 for size 6, and so on), and then calculate some values for each graph according to different node attributes. I have some code but it takes a long time to do all the calculations. I thought in using the graphlets function in order to get the different sizes but every time I run it on my computer it crash due to memory issues.

Here is the code I am using:

First step was to create a function to create the subgraphs of different sizes and do the calculations needed.

random_network<-function(size,G){
     score_fun<-function(g){                                                        
          subsum <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
           subsum
           } 

      genes.idx <- V(G)$name
      perm <- c()
      while(length(perm)<1000){
           seed<-sample(genes.idx,1) 
           while( length(seed)<size ){
                tmp.neigh <- V(G)[unlist(neighborhood(G,1,seed))]$name
                tmp.neigh <- setdiff(tmp.neigh, seed)
                if( length(tmp.neigh)>0 )  
                seed<-c(seed,sample(tmp.neigh,1)) else break 
            }
      if( length(seed)==size )
      perm <- c(perm,score_fun(induced.subgraph(G,seed)))
      } 
      perm
     } 

Second step was to apply the function to the actual graph

 ### generate some example data
 library(igraph)
 my_graph <- erdos.renyi.game(10000, 0.0003)
 V(my_graph)$name <- 1:vcount(my_graph)
 V(my_graph)$weight <- rnorm(10000)
 V(my_graph)$RWRNodeweight <- runif(10000, min=0, max=0.05)

 ### Run the code to get the subgraphs from different size and do calculations based on nodes
 genesets.length<- seq(5:500)
 genesets.length.null.dis <- list()
 for(k in 5:max(genesets.length){ 
     genesets.length.null.dis[[as.character(k)]] <- random_network(size=k,G=my_graph)
  }
like image 263
user2380782 Avatar asked Oct 12 '15 15:10

user2380782


2 Answers

Basically your algorithm for sampling a graph could be described as initializing the node set as a randomly selected node and then iteratively adding a neighbor of your current set until either there are no more neighbors or you have the desired subset size.

The expensive repeated operation here is determining the neighbors of the current set, which you do with the following:

tmp.neigh <- V(G)[unlist(neighborhood(G,1,seed))]$name
tmp.neigh <- setdiff(tmp.neigh, seed)

In short, you are looking at the neighbors of each node in your selected subset at each iteration. A more efficient approach would be to store a vector of neighbors and update it each time you add a new node; this will be more efficient because you only need to consider the neighbors of the new node.

josilber <- function(size, num.rep, G) {
  score_fun <- function(vert) sum(vert$weight*vert$RWRNodeweight)/sqrt(sum(vert$RWRNodeweight^2))
  n <- length(V(G)$name)

  # Determine which nodes fall in sufficiently large connected components
  comp <- components(G)
  valid <- which(comp$csize[comp$membership] >= size)

  perm <- replicate(num.rep, {
    first.node <- sample(valid, 1)
    used <- (1:n) == first.node  # Is this node selected?
    neigh <- (1:n) %in% neighbors(G, first.node)  # Does each node neighbor our selections?
    for (iter in 2:size) {
      new.node <- sample(which(neigh & !used), 1)
      used[new.node] <- TRUE
      neigh[neighbors(G, new.node)] <- TRUE
    }
    score_fun(V(G)[used])
  })
  perm
}

For a single replicate, this yields significant speedups over a single replicate of the code in the question:

  • For size=50, a single replicate takes 0.3 seconds for this code and 3.8 seconds for the posted code
  • For size=100, a single replicate takes 0.6 seconds for this code and 15.2 seconds for the posted code
  • For size=200, a single replicate takes 1.5 seconds for this code and 69.4 seconds for the posted code
  • For size=500, a single replicate for this code takes 2.7 seconds (so 1000 replicates should take about 45 minutes); I did not test a single replicate of the posted code.

As mentioned in the other answers, parallelization could further improve the performance of doing 1000 replicates for a given graph size. The following uses the doParallel package to parallelize the repeated step (the adjustment is pretty much identical to the one made by @Chris in his answer):

library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
josilber2 <- function(size, num.rep, G) {
  score_fun <- function(vert) sum(vert$weight*vert$RWRNodeweight)/sqrt(sum(vert$RWRNodeweight^2))
  n <- length(V(G)$name)

  # Determine which nodes fall in sufficiently large connected components
  comp <- components(G)
  valid <- which(comp$csize[comp$membership] >= size)

  perm <- foreach (i=1:num.rep, .combine='c') %dopar% {
    library(igraph)
    first.node <- sample(valid, 1)
    used <- (1:n) == first.node  # Is this node selected?
    neigh <- (1:n) %in% neighbors(G, first.node)  # Does each node neighbor our selections?
    for (iter in 2:size) {
      new.node <- sample(which(neigh & !used), 1)
      used[new.node] <- TRUE
      neigh[neighbors(G, new.node)] <- TRUE
    }
    score_fun(V(G)[used])
  }
  perm
}

On my Macbook Air, josilber(100, 1000, my_graph) takes 670 seconds to run (this is the non-parallel version), while josilber2(100, 1000, my_graph) takes 239 seconds to run (this is the parallel version configured with 4 workers). For the size=100 case, we have therefore gotten a 20x speedup from code improvements and an additional 3x speedup from parallelization, for a total speedup of 60x.

like image 161
josliber Avatar answered Oct 22 '22 04:10

josliber


I don't have a complete answer but here are some things to consider to help speed it up (assuming there is not a much faster approach using a different method).

  1. Remove from your graph any any nodes which are not part of a component as large as you are looking for. It will really depend on your network structure but it looks like your networks are genes so there are likely many genes with very low degree and you could get some speedups by removing them. Something like this code:

    cgraph <- clusters(G)
    tooSmall <- which(cgraph$csize < size)
    toKeep <- setdiff(1:length(V(G)), which(cgraph$membership %in% tooSmall))
    graph <- induced.subgraph(G, vids=toKeep)
    
  2. Consider running this in parallel to take advantage of multiple cores. For example, using the parallel package and mclapply.

    library(parallel)
    genesets.length<- seq(5, 500)
    names(genesets.length) <- genesets.length
    genesets.length.null.dis <- mclapply(genesets.length, mc.cores=7,
                                         function(length) {
                                           random_network(size=length, G=my_graph)
                                         })
    
like image 21
Stefan Avey Avatar answered Oct 22 '22 04:10

Stefan Avey