Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google interview algorithm puzzle: expected size of the largest connected component in a random simple graph (N nodes, N edges)?

Given a random simple graph of N nodes, N edges, and a uniform edge probability p, what is the expected size of the largest connected component?

  • The only initial input parameter supplied is N, which represents the number of nodes in the graph
  • The graph is simple, meaning that it is not allowed to contain any self-loops
  • There are exactly N node pairings, ie. the edgelist is of the form {(1,a), (2,b), (3,c), ..., (N,[next available letter of the alphabet])}. The conditions are that a!=1, b!=2, c!=3 etc. but other than that, a,b,c,... can take any value from 1 to N inclusive
  • The node n forming an edge with some other node can be described by a uniform probability distribution
  • Foreach such valid graph constructed of N nodes and N edges, find the size of the largest connected component S
  • By connected component, this is defined as the largest cluster/group of nodes where each node in the connected component can reach to/is reachable from every other node in the same connected component

The question is: among all such valid graphs constructed, what is the expected value of S?

I would like to know of an efficient way to think about/analyze/solve this problem that can deal with N ranging from 2 to 50 inclusive. Some code to illustrate would be helpful to my understanding.


I wrote the following source codes to naively bruteforce the possibilities, hoping to find a pattern from there. Managed to get up to N = 9 with these:

  • Python
  • C/C++

Both attempt to generate all possible permutations according to the criteria specified in the problem, and then calculate S for each graph generated via BFS.

Output so far

As for the format, eg. for the line "N=4: {2:3,4:78} 106/27", this means that there are 3 graphs with the size of the largest component = 2, and 78 graphs with the size of the largest component = 4, so the final answer = (3/(78+3))*2 + (78/(78+3))*4 = 106/27.

N=2: {2:1} 2/1
N=3: {3:8} 3/1
N=4: {2:3,4:78} 106/27
N=5: {3:80,5:944} 155/32
N=6: {2:15,3:640,4:1170,6:13800} 17886/3125
N=7: {3:840,4:21840,5:19824,7:237432} 38563/5832
N=8: {2:105,3:17920,4:229320,5:422912,6:386400,8:4708144} 6152766/823543
N=9: {3:153440,4:786240,5:9634464,6:9273600,7:8547552,9:105822432} 17494593/2097152

edit: Just discovered this OEIS sequence #A000435 gives the answers (formula/listing up to N=18) for the number of graphs with N nodes and largest size component = N. This is exactly coincidental with my bruteforce output, eg. for N=9, there are 105822432 graphs with the largest size of connected component = 9. Not sure if this helps - one of the formulas given there to derive this for larger N is a(n) = (n-1)! * Sum (n^k/k!, k=0..n-2) Python implementation


Example for N = 4:

There are a total of 81 possible edge assignments for 4 nodes (1-based indexed from 1 to N).

The format for the sample output below is: ((1, 2), (2, 1), (3, 1), (4, 1)): 4 means that there are edges between nodes 1 and 2, nodes 2 and 1, nodes 3 and 1, and nodes 4 and 1. Assume the edges are undirected/bidirectional. For such a graph, there is only a single connected component of all 4 nodes {1,2,3,4}, so the size of the largest (only) connected component is 4.

After generating this list, expected value of S can be computed via (fraction of all 81 instances whereS== 4) * 4 + (fraction of all 81 instances whereS== 2) * 2 =106/27 - since the only values S takes are 2,4.

{((1, 2), (2, 1), (3, 1), (4, 1)): 4,
 ((1, 2), (2, 1), (3, 1), (4, 2)): 4,
 ((1, 2), (2, 1), (3, 1), (4, 3)): 4,
 ((1, 2), (2, 1), (3, 2), (4, 1)): 4,
 ((1, 2), (2, 1), (3, 2), (4, 2)): 4,
 ((1, 2), (2, 1), (3, 2), (4, 3)): 4,
 ((1, 2), (2, 1), (3, 4), (4, 1)): 4,
 ((1, 2), (2, 1), (3, 4), (4, 2)): 4,
 ((1, 2), (2, 1), (3, 4), (4, 3)): 2,
 ((1, 2), (2, 3), (3, 1), (4, 1)): 4,
 ((1, 2), (2, 3), (3, 1), (4, 2)): 4,
 ((1, 2), (2, 3), (3, 1), (4, 3)): 4,
 ((1, 2), (2, 3), (3, 2), (4, 1)): 4,
 ((1, 2), (2, 3), (3, 2), (4, 2)): 4,
 ((1, 2), (2, 3), (3, 2), (4, 3)): 4,
 ((1, 2), (2, 3), (3, 4), (4, 1)): 4,
 ((1, 2), (2, 3), (3, 4), (4, 2)): 4,
 ((1, 2), (2, 3), (3, 4), (4, 3)): 4,
 ((1, 2), (2, 4), (3, 1), (4, 1)): 4,
 ((1, 2), (2, 4), (3, 1), (4, 2)): 4,
 ((1, 2), (2, 4), (3, 1), (4, 3)): 4,
 ((1, 2), (2, 4), (3, 2), (4, 1)): 4,
 ((1, 2), (2, 4), (3, 2), (4, 2)): 4,
 ((1, 2), (2, 4), (3, 2), (4, 3)): 4,
 ((1, 2), (2, 4), (3, 4), (4, 1)): 4,
 ((1, 2), (2, 4), (3, 4), (4, 2)): 4,
 ((1, 2), (2, 4), (3, 4), (4, 3)): 4,
 ((1, 3), (2, 1), (3, 1), (4, 1)): 4,
 ((1, 3), (2, 1), (3, 1), (4, 2)): 4,
 ((1, 3), (2, 1), (3, 1), (4, 3)): 4,
 ((1, 3), (2, 1), (3, 2), (4, 1)): 4,
 ((1, 3), (2, 1), (3, 2), (4, 2)): 4,
 ((1, 3), (2, 1), (3, 2), (4, 3)): 4,
 ((1, 3), (2, 1), (3, 4), (4, 1)): 4,
 ((1, 3), (2, 1), (3, 4), (4, 2)): 4,
 ((1, 3), (2, 1), (3, 4), (4, 3)): 4,
 ((1, 3), (2, 3), (3, 1), (4, 1)): 4,
 ((1, 3), (2, 3), (3, 1), (4, 2)): 4,
 ((1, 3), (2, 3), (3, 1), (4, 3)): 4,
 ((1, 3), (2, 3), (3, 2), (4, 1)): 4,
 ((1, 3), (2, 3), (3, 2), (4, 2)): 4,
 ((1, 3), (2, 3), (3, 2), (4, 3)): 4,
 ((1, 3), (2, 3), (3, 4), (4, 1)): 4,
 ((1, 3), (2, 3), (3, 4), (4, 2)): 4,
 ((1, 3), (2, 3), (3, 4), (4, 3)): 4,
 ((1, 3), (2, 4), (3, 1), (4, 1)): 4,
 ((1, 3), (2, 4), (3, 1), (4, 2)): 2,
 ((1, 3), (2, 4), (3, 1), (4, 3)): 4,
 ((1, 3), (2, 4), (3, 2), (4, 1)): 4,
 ((1, 3), (2, 4), (3, 2), (4, 2)): 4,
 ((1, 3), (2, 4), (3, 2), (4, 3)): 4,
 ((1, 3), (2, 4), (3, 4), (4, 1)): 4,
 ((1, 3), (2, 4), (3, 4), (4, 2)): 4,
 ((1, 3), (2, 4), (3, 4), (4, 3)): 4,
 ((1, 4), (2, 1), (3, 1), (4, 1)): 4,
 ((1, 4), (2, 1), (3, 1), (4, 2)): 4,
 ((1, 4), (2, 1), (3, 1), (4, 3)): 4,
 ((1, 4), (2, 1), (3, 2), (4, 1)): 4,
 ((1, 4), (2, 1), (3, 2), (4, 2)): 4,
 ((1, 4), (2, 1), (3, 2), (4, 3)): 4,
 ((1, 4), (2, 1), (3, 4), (4, 1)): 4,
 ((1, 4), (2, 1), (3, 4), (4, 2)): 4,
 ((1, 4), (2, 1), (3, 4), (4, 3)): 4,
 ((1, 4), (2, 3), (3, 1), (4, 1)): 4,
 ((1, 4), (2, 3), (3, 1), (4, 2)): 4,
 ((1, 4), (2, 3), (3, 1), (4, 3)): 4,
 ((1, 4), (2, 3), (3, 2), (4, 1)): 2,
 ((1, 4), (2, 3), (3, 2), (4, 2)): 4,
 ((1, 4), (2, 3), (3, 2), (4, 3)): 4,
 ((1, 4), (2, 3), (3, 4), (4, 1)): 4,
 ((1, 4), (2, 3), (3, 4), (4, 2)): 4,
 ((1, 4), (2, 3), (3, 4), (4, 3)): 4,
 ((1, 4), (2, 4), (3, 1), (4, 1)): 4,
 ((1, 4), (2, 4), (3, 1), (4, 2)): 4,
 ((1, 4), (2, 4), (3, 1), (4, 3)): 4,
 ((1, 4), (2, 4), (3, 2), (4, 1)): 4,
 ((1, 4), (2, 4), (3, 2), (4, 2)): 4,
 ((1, 4), (2, 4), (3, 2), (4, 3)): 4,
 ((1, 4), (2, 4), (3, 4), (4, 1)): 4,
 ((1, 4), (2, 4), (3, 4), (4, 2)): 4,
 ((1, 4), (2, 4), (3, 4), (4, 3)): 4}
like image 204
Adeline Ho Avatar asked Feb 15 '15 15:02

Adeline Ho


People also ask

What is the number of simple graph possible with n vertices and e edges?

The maximum number of edges possible in a single graph with 'n' vertices is nC2 where nC2 = n(n – 1)/2. The number of simple graphs possible with 'n' vertices = 2nc2 = 2n(n-1)/2.

How many edges does a graph have with N nodes?

If you have N nodes, there are N - 1 directed edges than can lead from it (going to every other node). Therefore, the maximum number of edges is N * (N - 1) .


1 Answers

Before looking at the probabilities, let us look at a similar problem in combinatorics, i.e. given the rules of the graphs and N nodes, how many graphs can be attained when the largest connected component is S.

Notation

For the rest on this answer I will use the following notation:

C(S=s|N) notes the total count of possible graphs with N nodes and N edges as defined in the question with the largest connects component is S=s.

Examples:

  1. C(S=2|2) = 1
  2. C(S=2|3) = 0
  3. C(S=3|3) = 8

T(s) notes the number of graphs with s nodes and s edges as defined in the question where all nodes are connected. T(s) = C(S=s|s).

Examples

  1. T(2) = 1
  2. T(3) = 8

Key observations

Some key observation to lay out the stages of the solution:

  1. C(S=N-1|N) = 0 since the last node must be connected to itself. This is however not a legal graph and therefore the count for this case is 0
  2. C(S=2|N) is different than 0 only for even values of N. If N is odd, there has to be a node connected to itself and again, this is illegal.
  3. C(S=2|N) = (N choose 2) * C(S=2|N-2). This also correlates with 2. Since this recursion will eventually reach C(S=2|1) =0 and the entire expression will be 0 for odd N values.
  4. Expanding 3. to a slightly more general case we can get C(S=2|N) = (N choose 2) * T(2) * C(S=2|N-2) since T(2) = 1 this holds.
  5. Expanding 4. for a general case we get C(S=s|N) = (N choose s) * T(s) * [C(S=2|N-s) + C(S=3|N-s) + ... + C(S=s|N-s)]

Note that these expressions include double counts, we will get to that later.

This means that we found a recursive pattern to compute the number of graph (up to double counts), all we need is a method to compute T(s).

Computing T(s)

For a N nodes N edges graph as defined in the question, T(N) can be computed via three components:

  1. (N choose N-1) * T(N-1) * (N-1): The fully connected structure has a N-1 connected graph and the new node is connected to either of the N-1 nodes creating a N connected graph. This section contains double counts.
  2. Number of graphs creating a cycle the size of N which is (N-1)!
  3. correcting double counts

Therefore we can compute T(s) as follows (the explanation for the double counts in this case is too complex for me to write down an explanation): enter image description here

We note that for T(2) and T(3) we do not have double counts.

Examples

  1. T(3) = (3 choose 2)*T(2)*2 + 2! = 3*1*2+2=8
  2. T(4) = (4 choose 3)*T(3)*3 + 3! - (4 choose 2)*T(2)*2^2 = 4×8×3 +6 -6*1*4=78
  3. T(5) = (5 choose 4)*T(4)*4 + 4! - (5 choose 3)*T(3)*3^2 + (5 choose 2)*T(2)*2^3 = 5*78*4 +24 -10*8*9 +10*1*2^3 = 944

The only thing left is to find a way to overcome double counts in key observation 5. This is done using the link above. After breaking N to a different groups, we correct the double count by dividing the result by a factor to complete it.

Examples

  1. In case the have N=6 nodes and we compute
C(S=2|6) = (6 choose 2)*T(2)*C(S=2|4) = (6 choose 2)*T(2)*(4 choose 2)*T(2)

We can see that we divided 6 into three groups of 2 therefore we need to correct the double count by applying a factor of 3!:

C(S=2|6) = (6 choose 2)*T(2)*(4 choose 2)*T(2)/(3!) = 15
  1. In case the have N=9 nodes and we compute
C(S=3|9) = (9 choose 3)*T(3)*[C(S=2|6) + C(S=3|6)]

We can see that the first component of the brackets, i.e. C(S=2|6) was computed in example 1, and a factor of 3! was divided from the result because 6 was divided into three groups of 2. for the second component of the brackets we get:

C(S=3|9) = (9 choose 3)*T(3)*[15 + (6 choose 3)*T(3)*C(S=3|3)] = (9 choose 3)*T(3)*[15 + (6 choose 3)*T(3)*T(3)]

We see that 9 was divided into three groups of 3 and therefore we add a factor of 3! to counter the double counts:

C(S=3|9) = (9 choose 3)*T(3)*[15 + (6 choose 3)*T(3)*T(3)/(3!)] = 153440

Everything is now set up to compute C(S=s|N) for all s valuees between 2 and N. Then the probability can achieved by dividing by the sum of all graphs:

P(S=s|N) = C(S=s|N) / sum_s{C(S=s|N)}

And the expected value can be found via:

E[S|N] = sum_s{s * P(S=s|N)}

Implementation

Below I will show a python implementation for the method derived above. I will compute the values including the double counts to keep the recursive property, thus allowing me to compute all the graph counts in O(N^2) and I will keep record of the splitting to know how to correct the double counts later on. The code for the counts with comments is as follows:

import numpy as np
from math import comb
from collections import Counter

class ExpectedSize:
    def __init__(self, N):
        self.N = N
        self.counts           = None
        self.group_dict       = None
        self.corrected_counts = None
        self.expected_size    = None

    def get_T(self):
        """
        :param N: Number of nodes and edges
        :return: A vector where the ith location contains the number of graphs with i nodes and i edges where all the nodes
         are connected
        """
        T = np.zeros(self.N + 1)
        for ii in range(len(T)):
            if ii < 2:
                T[ii] = 0
            elif ii == 2:
                T[ii] = 1
            else:
                n_vec      = np.arange(1, ii-2)
                sign_vec   = (-1) ** n_vec
                choose_vec = [comb(ii, ii-1-jj) for jj in n_vec]
                t_vec      = T[ii-2:1:-1]
                power_vec  = (ii -1 - n_vec) ** (n_vec + 1)
                T[ii] = ii * T[ii - 1] * (ii - 1) + np.math.factorial(ii - 1) + np.sum(sign_vec * choose_vec * t_vec * power_vec) # X choose (X-1) = X
        return T

    def get_counts(self):
        # Init
        T_s = self.get_T()
        # counts = np.zeros((self.N + 1, self.N + 1))
        # counts[:, 0] = 1
        # counts[2, 2] = 1
        counts = {(2,2): [1]}
        for ii in range(self.N + 1):
            counts[(ii, 0)] = [1]
        group_dict = {(2,2): [[2]]}
        for NN in range(3, self.N + 1):
            for ss in range(2, NN+1):
                # Computing C(S = ss | ii) without the double-counting correction
                init_term = comb(NN, ss) * T_s[ss]
                # creating new split list
                split_list = []
                count_list = []
                for sigma in range(2, ss+1):
                    key = (sigma, NN-ss)
                    if NN-ss == 0 and sigma > 2:
                        break
                    # Getting sum components
                    temp_components = counts.get((sigma, NN-ss), [0])
                    # Updating split counts
                    group_count = group_dict.get(key, [[]])
                    # appending split list to the total list if needed
                    for gg in group_count:
                        if temp_components[0] > 0:
                            split_list.append(gg + [ss])
                            group_dict[(ss, NN)] = split_list
                            count_list += [temp_components[0]]
                counts[(ss, NN)] = [init_term * cc for cc in count_list] if len(count_list) > 0 else [0]
        self.counts     = counts
        self.group_dict = group_dict

    def correct_double_counts(self):
        counts           = self.counts
        group_dict       = self.group_dict
        corrected_counts = {}
        for key in counts:
            group_count    = group_dict.get(key, [[]])
            sum_components = counts[key]
            corrected_temp = []
            for ii, sum_component in enumerate(sum_components):
                counter_dict = Counter(group_count[ii])
                corrected_temp += [sum_component]
                for split_key in counter_dict:
                    corrected_temp[-1] = corrected_temp[-1]/ np.math.factorial(counter_dict[split_key])
            corrected_counts[key] = sum(corrected_temp)
        self.corrected_counts = corrected_counts

    def get_expected_size(self):
        cumsum   = 0
        expected = 0
        for ii in range(1, self.N+1):
            key = (ii, self.N)
            cumsum += self.corrected_counts[key]
            expected += ii * self.corrected_counts[key]
        self.expected_size = expected / cumsum

Running the following code for N=9 results in:

exp_size = ExpectedSize(9)
exp_size.get_counts()
exp_size.correct_double_counts()
print(exp_size.corrected_counts)
(2, 2): 1.0,
(2, 3): 0, (3, 3): 8.0,
(2, 4): 3.0, (3, 4): 0, (4, 4): 78.0,
(2, 5): 0, (3, 5): 80.0, (4, 5): 0, (5, 5): 944.0,
(2, 6): 15.0, (3, 6): 640.0, (4, 6): 1170.0, (5, 6): 0, (6, 6): 13800.0,
(2, 7): 0, (3, 7): 840.0, (4, 7): 21840.0, (5, 7): 19824.0, (6, 7): 0, (7, 7): 237432.0,
(2, 8): 105.0, (3, 8): 17920.0, (4, 8): 229320.0, (5, 8): 422912.0, (6, 8): 386400.0, (7, 8): 0, (8, 8): 4708144.0,
(2, 9): 0, (3, 9): 153440.0, (4, 9): 786240.0, (5, 9): 9634464.0, (6, 9): 9273600.0, (7, 9): 8547552.0, (8, 9): 0, (9, 9): 105822432.0}

We can see that the counts match the brute force computations of OP (the keys are (S,N).

Counts to expected value

Now the expectation computation can be done with ease using the get_expected_size function

Examples

  1. N=2 --> E[S|N] = 2
  2. N=3 --> E[S|N] = 3
  3. N=4 --> E[S|N] = 3.925925925925926 ; 126 / 27 = 3.925925926
  4. N=5 --> E[S|N] = 4.84375 ; 155 / 32 = 4.84375
  5. N=6 --> E[S|N] = 5.72352 ; 17886 / 3125 = 5.72352
  6. N=7 --> E[S|N] = 6.612311385459534 ; 38563 / 5832 = 6.612311385
  7. N=8 --> E[S|N] = 7.471092584115219 ; 6152766 / 823543 = 7.471092584
  8. N=9 --> E[S|N] = 8.342072010040283 ; 17494593 / 2097152 = 8.34207201
  9. N=10 --> E[S|N] = 9.189007166835722
  10. N=11 --> E[S|N] = 10.047272752
  11. N=12 --> E[S|N] = 10.891595346262637

And we can see that the overall results are accurate w.r.t the brute force results up to precision point of Python. I this is a sufficient solution.

like image 71
Tomer Geva Avatar answered Nov 13 '22 19:11

Tomer Geva