Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

subcomponent(mode = "in") for multiple source vertices

Question

In the igraph R package, is there an efficient implementation of subcomponent() and/or BFS that can handle multiple source vertices?

Motivation

The drake R package models a user's workflow as a DAG of interdependent objects and files. The DAG should only contain the user's targets and their upstream dependencies, so drake uses igraph::subcomponent() to eliminate superfluous vertices. This approach is inefficient because the v argument must be a single vertex, so drake ends up conducting a new BFS for every single target the user wants to build.

EDIT: 2019-01-10

drake now uses a different approach that ultimately relies on sequential calls to adjacent_vertices(). The approach is clunky, but the speed improvement is actually quite nice. Still holding out for something more elegant and sophisticated.

like image 276
landau Avatar asked Jan 07 '19 03:01

landau


1 Answers

I think you can do this using the distances() function which generates a matrix of distances (no of edges) between nodes. This seems to do the search once and is a good deal faster than iterating through each vertex.

Example code:

library(igraph)
library(microbenchmark)

# generate some random testing data
set.seed(1234)
g <- erdos.renyi.game(50, .01)

# Here we make a function that iterates 
# across the vector of IDs applying the function
# and returns a list where each element is the
# ids of the subcomponents
sc_apply <- function(g) {
  vs <- V(g)
  res <- sapply(vs, function(v){as.numeric( # to facilitate comparison
    subcomponent(g, v, mode = "in")
    )})
  res
}

# Try it for testing
t1 <- sc_apply(g)

# Here we get the matrix of node distances. Infinite distance
# implies a seperate component. We iterate through rows of
# matrix to extract the set of nodes that are connected 
sc_distmat <- function(g) {
  dmat <- distances(g, mode = "in")
  res <- apply(dmat, 1, function(row){which(is.finite(row))})
  res
}

# extract for testing
t2 <- sc_distmat(g)

# check that equal (we need to sort the 
# subcomponent list elements first to facilitate comparison)
all.equal(lapply(t1, sort), t2)
#> [1] TRUE

The results are identical -- although worth noting that if your graph is one giant component than apply will return a matrix to you instead of a list so you will need to do your comparisons in a slightly different way.

Ok now lets see how much/whether this is faster:

# generate graphs of different sizes (not too big because my
# laptop is borderline antique!)
set.seed(42)
small_g <- erdos.renyi.game(10, .2)
mid_g <- erdos.renyi.game(50, .1)
big_g <- erdos.renyi.game(100, .1)

# check speed improvement
microbenchmark(sc_apply(small_g), sc_distmat(small_g))
#> Unit: microseconds
#>                 expr      min        lq      mean   median        uq
#>    sc_apply(small_g) 2181.465 2243.4895 2734.7132 2313.005 2777.7135
#>  sc_distmat(small_g)  451.333  471.8565  840.4742  521.865  598.0845
#>        max neval cld
#>   9152.262   100   b
#>  27139.262   100  a
microbenchmark(sc_apply(mid_g), sc_distmat(mid_g))
#> Unit: microseconds
#>               expr       min        lq       mean    median         uq
#>    sc_apply(mid_g) 11006.113 11327.794 13590.9536 12919.492 15397.2510
#>  sc_distmat(mid_g)   757.752   795.308   949.2551   841.834   965.4545
#>        max neval cld
#>  27068.296   100   b
#>   2061.824   100  a
microbenchmark(sc_apply(big_g), sc_distmat(big_g))
#> Unit: milliseconds
#>               expr      min        lq      mean    median        uq
#>    sc_apply(big_g) 23.11678 26.696373 29.940675 29.191045 33.012796
#>  sc_distmat(big_g)  1.67531  1.727873  2.156307  1.855994  2.244872
#>        max neval cld
#>  47.081647   100   b
#>   7.576123   100  a

As you can see the distances() approach is faster, and increasingly faster as the size of the graph grows.

Created on 2019-01-10 by the reprex package (v0.2.1)

like image 173
gfgm Avatar answered Sep 25 '22 01:09

gfgm