Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient algorithm for determining if X out of N inputs are true

This question was inspired by an answer I was working on yesterday.

Let's say we have N inputs that evaluate to either true or false, what is the most efficient way to determine if X of those inputs are true?

Caveats:

  1. The inputs are not in an array, so if you convert them to an array please account for any overhead costs.
  2. By "most efficient" I mean in terms of best average case (although I would love to see best and worst case stats too).

Here are two of the methods I came across yesterday.

1) Think of the variables as Boolean inputs for a circuit and reduce them using a K-map

At first I thought this would be the most efficient means because it follows circuit logic, but I've definitely had second thoughts. As the number of inputs increases, the number of comparisons goes up exponentially

2 inputs:
   1 of 2: if(1 OR 2)
   2 of 2: if(1 AND 2)

3 inputs:
   1 of 3: if(1 OR 2 OR 3)
   2 of 3: if((1 AND 2) OR (1 AND 3) OR (2 AND 3))
   3 of 3: if(1 AND 2 AND 3)

4 inputs:
   1 of 4: if(1 OR 2 OR 3 OR 4)
   2 of 4: if((1 AND 2) OR (1 AND 3) OR (1 AND 4) OR (2 AND 3) OR (2 AND 4) OR (3 AND 4))
   3 of 4: if((1 AND 2 AND 3) OR (1 AND 2 AND 4) OR (1 AND 3 AND 4) OR (2 AND 3 AND 4))
   4 of 4: if(1 AND 2 AND 3 AND 4)

... etc. ...

The best case scenario is fine (O(1)), but the worst case is far worse than...

2) A counter and sequential if statements

This performs in O(n) time always, which is OK, but I was hoping for a better best case.

counter = 0

if(input 1)
   counter++

if(input 2)
   counter++

if(input 3)
   counter++

... etc. ...

if(counter >= X)
   // true

What is a more efficient solution than either of these?

like image 343
Swift Avatar asked Jul 14 '11 14:07

Swift


2 Answers

On the problem's complexity
Since an exact count is requested (as opposed to say asking if at least x inputs are on), the problem is very clearly O(n) :

  • one needs to visit every single input, and,
  • the work for each input is independent from n (while the work for a given input may vary based on the particular value of the input, the amount of work [for this input] doesn't vary if the number of inputs is changed.)

We can certainly implement sub-optimal algorithms which, for example, would [unnecessarily] visit all other inputs as each input is being processed, making it an O(n^2) implementation, but that is of course silly.

This being asserted, the questions is therefore probably about...
tricks which would make the implementation faster
It should be noted that while such tricks may exist, the complexity of the algorithm/problem remains stubbornly at O(n).

Trick 1: Better storage for inputs
Unfortunately, the question indicates that the inputs come from named variables and that the cost for any conversion of the input [for the sake of allowing faster counting] would have to be taken into account for the evaluation of the overall performance of the algorithm. Although this eventually depends on the underlying language, runtime etc., the need to account for the conversion cost very likely condemns any algorithm based on alternate storage to be slower than solutions which keep the inputs as-is.

Trick 2: short-circuit the evaluation
The idea is to return false as soon (or shortly after) as either

  • the running count of inputs that are on is bigger than X the number (or, if we are counting the number of input that are off, when this count exceeds (n - X))
  • the number of inputs left to test plus the running count of inputs that are on is less that X. (or something similar in the case of counting the off inputs).

This trick is relatively straight forward, but the extra cost for computing the values needed in the early exit tests may offset the gains made by [statically] exiting early.

Trick 3: use reverse logic: count the number of inputs that are off rather than these are are on. (or count both).
The cost/benefits of this approach depends on both the number of on input to test for (the X of the question) and on the statistical prior we may have about the inputs (is the number on on-inputs at a given time relatively uniformly distibuted or do we tend to have only a few inputs on (or off)).

The solution proposed by Chris Acheson provides a baseline for the use of both Trick 2 and Trick 3. Assuming that we could make a few assumptions about the distribution of the inputs' state, additional performance improvements to this baseline would be driven such "priors": some quick heuristics done prior to the counting of the inputs per se would determine whether which state we should count (on or off or both), which limit we should test for etc. and branch to the corresponding "versions" of the algorithm.

Additional gains are also possible, if we know the individual probability of a given input to be on or off, as we'd then test for the most (or least) likely ones first, to quickly get to our "short circuit value".

On the best-case/worse-case "complexity" of these optimized algorithms
Assuming that

  • the number of inputs that are on at a given time is uniformly distributed
  • all inputs have a 50% change of being on at a given time
  • X is randomly selected between 1 and n

A combination of Tricks #2 and #3 could be O(X/2) on average (I need to do the math, but that seems right). However I think it wiser to talk in terms of number of operations relative to X and/or n, rather than misusing the O-notation...
Assuming that all operations roughly incur the same cost

  • Initialize a counter or variable
  • Test an input or a variable
  • addition or subtraction
  • etc

It is easier and also more exact to compute the total number of operations needed for a given algorithm to complete, and hence to use such counts, for various best/worse/average cases to help decide on specific algorithms.
To illustrate this, a naive implementation that merely would systematically count all on-inputs and only compare the counter at the end, would be of O(n) complexity and complete in all cases in roughly 1 + 2*n + 1 operations. Such an algorithm may prove to be overall ,better than a fancier algorithm which while being, say, O(X), O((X+n)/2) and O(n) in the best, average and worse cases, respectively, may well use X*3, (X+n)* 1.5, and n*3 operations in these same cases.

like image 120
mjv Avatar answered Nov 15 '22 21:11

mjv


This version will be efficient for values of X close to either zero or N:

true_counter = 0
false_counter = 0
max_false = N - X

if(input 1)
   true_counter++
   if(counter >= X)
      return true
else
   false_counter++
   if(false_counter > max_false)
      return false

// and so on
like image 30
Chris Acheson Avatar answered Nov 15 '22 21:11

Chris Acheson