Calculating the probability of system failure in a distributed network

Tags:

I am trying to build a mathematical model of the availability of a file in a distributed file-system. I posted this question at MathOverflow but this might as well be classified as a CS-question so I give it a shot here as well.

The system works like this: a node stores a file (encoed using erasure codes) at r*b remotes nodes, where r is the replication-factor and b is an integer constant. Erasure-coded files have the property that the file can be restored iff at least b of the remote nodes are available and return its part of the file.

The simplest approach to this is to assume that all remote nodes are independent of each other and have the same availability p. With these assumptions the availability of a file follows the Binomial distribution, i.e. $Binomial distribution$

Unfortunately these two assumptions can introduce a non-neligible error, as shown by this paper: http://deim.urv.cat/~lluis.pamies/uploads/Main/icpp09-paper.pdf.

One way to overcome the assumption that all nodes have the same availability is to calculate the probability of each possible combination of availaible/non-available node and take the sum of all these outcomes (which is sort of what they suggest in the paper above, just more formally than what I just described). You can see this approach as a binary tree with depth r*b and each leave is one possible combination of available/not-available nodes. The file's availability is the same thing as the probablity that you arrive at a leave with >=b available nodes. This approach is more correct but has a computational cost of Ordo . Also, it doesn't deal with the assumption of node independence.

Do you guys have any ideas of a good approximation which introduces less error than the binomial distribution-aproximation but with better computational cost than ?

You can assume that the availability-data of each node is a set of tuples consisting of (measurement-date, node measuring, node being measured, succes/failure-bit). With this data you could for example calculate the correlation of the availability between the nodes and the availability variance.

927

asked Jun 22 '10 11:06

Yrlec

2 Answers

For large r and b you can use a method called Monte-Carlo integration, see e.g. Monte Carlo integration on Wikipedia (and/or chapter 3.1.2 of SICP) to compute the sum. For small r and b and significantly different node-failure probabilities p[i] the exact method is superior. The exact definition of small and large will depend on a couple of factors and is best tried out experimentally.

Specific sample code: This is a very basic sample code (in Python) to demonstrate how such a procedure could work:

Click to copy

def montecarlo(p, rb, N):
    """Corresponds to the binomial coefficient formula."""
    import random
    succ = 0

    # Run N samples
    for i in xrange(N):
        # Generate a single test case
        alivenum = 0
        for j in xrange(rb):
            if random.random()<p: alivenum += 1
        # If the test case succeeds, increase succ
        if alivenum >= b: succ += 1
    # The final result is the number of successful cases/number of total cases
    # (I.e., a probability between 0 and 1)
    return float(succ)/N

The function corresponds to the binomial test case and runs N tests, checking if b nodes out of r*b nodes are alive with a probability of failure of p. A few experiments will convince you that you need values of N in the range of thousands of samples before you can get any reasonable results, but in principle the complexity is O(N*r*b). The accuracy of the result scales as sqrt(N), i.e., to increase accuracy by a factor of two you need to increase N by a factor of four. For sufficiently large r*b this method will be clearly superior.

Extension of the approximation: You obviously need to design the test case such, that it respects all the properties of the system. You have suggested a couple of extensions, some of which can be easily implemented while others can not. Let me give you a couple of suggestions:

1) In the case of distinct but uncorrelated p[i], the changes of the above code are minimal: In the function head you pass an array instead of a single float p and you replace the line if random.random()<p: alivenum += 1 by

Click to copy

if random.random()<p[j]: alivenum += 1

2) In the case of correlated p[i] you need additional information about the system. The situation I was referring to in my comment could be a network like this:

Click to copy

A--B--C
   |  |
   D  E
   |
   F--G--H
      |
      J

In this case A might be the "root node" and a failure of node D could imply the automatic failure with 100% probability of nodes F, G, H and J; while a failure of node F would automatically bring down G, H and J etc. At least this was the case I was referring to in my comment (which is a plausible interpretation since you talk about a tree structure of probabilities in the original question). In such a situation you would need modify the code that p refers to a tree structure and for j in ... traverses the tree, skipping the lower branches from the current node as soon as a test fails. The resulting test is still whether alivenum >= b as before, of course.

3) This approach will fail if the network is a cyclic graph that cannot be represented by a tree structure. In such a case you need to first create graph nodes that are either dead or alive and then run a routing algorithm on the graph to count the number of unique, reachable nodes. This won't increase the time-complexity of the algorithm, but obviously the code complexity.

4) Time dependence is a non-trivial, but possible modification if you know the m.t.b.f/r (mean-times-between-failures/repairs) since this can give you the probabilities p of either the tree-structure or the uncorrelated linear p[i] by a sum of exponentials. You will then have to run the MC-procedure at different times with the corresponding results for p.

5) If you merely have the log files (as hinted in your last paragraph) it will require a substantial modification of the approach which is beyond what I can do on this board. The log-files would need to be sufficiently thorough to allow to reconstruct a model for the network graph (and thus the graph of p) as well as the individual values of all nodes of p. Otherwise, accuracy would be unreliable. These log-files would also need to be substantially longer than the time-scales of failures and repairs, an assumptions which may not be realistic in real-life networks.

121

answered Sep 24 '22 07:09

user8472

Assuming each node has a constant, known and independent availability rate, A divide and conquer approach come to mind.

Say you have N nodes.

Split them into two sets of N/2 nodes.
For each side, compute the probability that any number of nodes ([0,N/2]) are down.
Multiply and sum these as needed to find the probability that any number ([0,N]) of the full set is down.

Step 2 can be done recursively and at the top level you can sum as need to find how often more than some number are down.

I don't know the complexity of this but if I has to guess, I'd say at or below O(n^2 log n)

The mechanics of this can be illustrated on a terminal case. Say we have 5 nodes with up times $p_1...p_5$ . We can split this into segments A with $p_1...p_2$ and B with $p_2...p_6$ . We then can process these to find the "N nodes up" times for each segment:

For A:

$a_2=p_1p_2$

$a_1=p_1(1-p_2)+(1-p_1)p_2$

$a_0=(1-p_1)(1-p_2)$

For B:

$b_2=p_3p_4p_5$

$b_2..b_0$

The final results for this stage can be found by multiplying each of the a's with each of the b's and summing appropriately.

Click to copy

v[0] = a[0]*b[0]
v[1] = a[1]*b[0] + a[0]*b[1]
v[2] = a[2]*b[0] + a[1]*b[1] + a[0]*b[2]
v[3] =             a[2]*b[1] + a[1]*b[2] + a[0]*b[3]
v[4] =                         a[2]*b[2] + a[1]*b[3]
v[5] =                                     a[2]*b[3]

answered Sep 24 '22 07:09

BCS

Related questions
                            
                                What is Tuple? And tuple vs. List vs. Vector?
                            
                                What is regularity?
                            
                                What's the difference between a boolean literal and a boolean value?
                            
                                How do you define an anonymous function in lambda calculus terms (or how can I say some language supports anonymous functions)?
                            
                                Fast way to find a intersection between two sets of numbers, one defined by a bitwise condition and another by an arithmetic condition
                            
                                E_ACCESSDENIED when creating a host-only interface on virtualbox via vagrant [closed]
                            
                                What is the difference between "hill climbing" and "branch-and-bound" search algorithms?
                            
                                Closures and Context Free Grammars
                            
                                Proving that P <= NP
                            
                                "Introduction to Computer Science and Programming" for a beginner [closed]
                            
                                What would you call the time complexity of an algorithm of this sort?
                            
                                Hashes: Tables, Lists and Maps, Oh My?
                            
                                What class of language can Perl regular expressions be used against?
                            
                                Data structure for quick match in a card game
                            
                                Partition Problems Brute Force Algorithm
                            
                                How does an X-bit computer work with 2X-bit numbers?
                            
                                Fault/error/failure in this example [closed]
                            
                                Memory-efficient way of computing the median of a large data set? [closed]
                            
                                Algorithm for optimally choosing actions to perform a task

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculating the probability of system failure in a distributed network

Tags:

computer-science

time-complexity

distributed

high-availability

binomial-cdf

Yrlec

People also ask

2 Answers

user8472

BCS

Recent Activity

Donate For Us