Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why aren't nodes of the last index over-represented in random networks generated with igraph?

Tags:

c

r

igraph

I am using the R interface of igraph to generate a random directed network (Erdös-Rényi) with a constant number of nodes, n, and edges, m, using the function sample_gnm.

To make sure I understand which is the algorithm used, I checked the C source code although I have no experience in C. As far as I understand the C code, there is an if statement that should lead to an over-representation of nodes with index n receiving directed edges.

This is the real code: https://github.com/igraph/igraph/blob/7d4be976481356fa673772e6e7c30b637ea8dd52/src/games.c#L734-L736 and this is how I understand the C code in pseudo-code:

# What is the maximum number of edges a network with n nodes could have
maxEdges := n*(n-1)

s := uniformly sample m integers from [1, maxEdges] without replacement

for (i = 1; i = m; i++) {

  # Get IDs for nodes A and B with equal probability over n
  nodeA := floor(s[i] / (n)) + 1
  nodeB := s - ((nodeA - 1) * n)

  # Since we do not allow loops, if nodeA = nodeB, assign n to nodeB
  if (nodeA = nodeB) {
    nodeB := n
  }

}

However, I also run a simulation in R to make sure that this is really the case:

testFun = function(n,m) {

  # Generate the network
  g = sample_gnm(n, m, directed = TRUE, loops = FALSE)

  # Find the "to" node IDs
  toEdgename = ends(g, E(g))[, 2]

  return(toEdgename)

}

# Create 1000 random networks and get the "to" node name for each edge
spam = replicate(1000, testFun(100, 9000))
# Plot the histogram
hist(sapply(1:ncol(spam), 
            # Count the percent of times the index 100 appeared per simulation
            function(ii) sum(spam[, ii] == 100) / 9000), 
     100)

To my surprise, it does not lead to an observable bias. This must mean that I do not comprehend what the C code is doing. Could anyone please help me understand why does this piece of C code not lead to over-representation of the n index?

like image 969
vkehayas Avatar asked Oct 16 '22 13:10

vkehayas


1 Answers

The reason is that nodeB in your pseudo-code can never be n (or, in the C code, it can never be no_of_nodes - 1. (However, nodeA can be n!)

In fact, the maximum value for nodeB is given by maxEdges (mod n−1), and values in mod n−1 are in the range [0, n−1[; note that the upper bound is exclusive.

like image 82
Konrad Rudolph Avatar answered Oct 21 '22 02:10

Konrad Rudolph