Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Big O time complexity of a while loop with a random Object

public class Question2 {
          //running time of function is N!!!!!!
    public static boolean isThere(int[] array, int num, int index){
        boolean isItThere = false;   //running time of 1
        for(int i =0; i <= index; i++){      //running time i
            if(array[i] == num){   //running time of 1
                isItThere = true;   //running time of 1
            }
        }
        return isItThere;
    }


    public static int[] firstAlgo(int N){
            Random random = new Random();  //running time of 1(initilizing)k
          int[] arr = new int[N];
          for (int i = 0; i < N; i++){
              int temp = random.nextInt(N+1);  //running time of random is O(1)
              while (isThere(arr, temp, i)){
                  temp = random.nextInt(N+1);
              }
              arr[i] = temp;
          }
         return arr;
    }

}

I want to figure out the time complexity of the while loop, I know the running time of the isThere function is N and So is the main for loop in firstAlgo

like image 312
jose_burg Avatar asked Jan 25 '23 17:01

jose_burg


1 Answers

The short version:

  • The expected runtime is Θ(N2 log N).
  • I have the math to back this up, as well as empirical data.

Here's a plot comparing the empirical amount of work done to the best-fit approximation I got for a function of the form N2 log N, which was (N2 ln N) / 1.06:

Comparison of the empirical work done to a predicted N^2 log N behavior, showing an almost exact match.

Curious? Keep reading. :-)

Let's take a step back from the code here and see what the actual logic is doing. The code works as follows: for each prefix of the array 0, 1, 2, 3, ..., N, the code continuously generates random numbers between 0 and N until it generates one that hasn't been seen before. It then writes down that number in the current array slot and moves on to the next one.

A few observations that we'll need in this analysis. Imagine that we're about to enter the kth iteration of the loop in firstAlgo. What can we say about the elements of the first k slots of the array? Well, we know the following:

  1. The elements at positions 0, 1, 2, 3, ..., k-1 are all different from one another. The reason for this is that each loop iteration only stops once it's found something that doesn't otherwise appear in the array up to that point.
  2. None of those values are equal to 0, because the array is initially populated with 0s and if 0 is generated in a previous step it won't be allowed.
  3. As a consequence of (1) and (2), the elements in slots 0, 1, 2, ..., k-1, and k are all different.

Now, we can get down to some math. Let's look at iteration k of the loop in firstAlgo. Each time we generate a random number, we look at (k+1) array slots to make sure the number doesn't appear there. (I'm going to use this quantity, by the way, as a proxy for the total work done, since most of the energy will be spent scanning that array.) So then we need to ask - on expectation, how many numbers are we going to generate before we find a unique one?

Fact (3) from the above list is helpful here. It says that on iteration k, the first k+1 slots in the array are different from one another, and we need to generate a number different from all of those. There are N+1 options of random numbers we can pick, so there are (N+1) - (k+1) = N - k options for numbers we could pick that won't have been used. This means that the probability that we pick a number that hasn't yet come up is (N - k) / (N + 1).

A quick check to make sure this formula is right: when k = 0, we are okay generating any random number other than 0. There are N+1 choices, so the probability we do this is N / (N+1). That matches our formula from above. When k = N-1, then all previous array elements are different and there's only one number we can pick that will work. That means we have a success probability of 1 / (N+1), again matching our formula. Cool!

There's a useful fact from probability theory that if you keep flipping a biased coin that has probability p of coming up heads, the expected number of flips before you flip heads is 1 / p. (More formally, that's the mean of a geometric random variable with success probability p, and you can prove this using the formal definition of expected values and some Taylor series.) This means that on the kth iteration of that loop, the expected number of random numbers we'll need to generate before we get a unique one is (N + 1) / (N - k).

Overall, this means that the expected amount of work done on iteration k of the loop in firstAlgo is given by (N + 1)(k + 1) / (N - k). That's because

  • there are, on expectation, (N + 1)/(N - k) numbers generated, and
  • each generated number requires (k + 1) array slots to be checked.

We can then get our total amount of work done by summing this up from k = 0 to N - 1. That gives us

      0+1          1+1          2+1                 N
(N+1)----- + (N+1)----- + (N+1)----- + ... + (N+1)-----
      N-0          N-1          N-2                 1   

Now, "all" we have to do is simplify this summation to see what we get back. :-)

Let's begin by factoring out the common (N + 1) term here, giving back

       /  1     2     3           N  \
 (N+1)|  --- + --- + --- + ... + --- |
       \  N    N-1   N-2          1  /

Ignoring that (N + 1) term, we're left with the task of simplifying the sum

 1     2     3           N
--- + --- + --- + ... + ---
 N    N-1   N-2          1

Now, how do we evaluate this sum? Here's a helpful fact. The sum

 1     1     1           1
--- + --- + --- + ... + ---
 N    N-1   N-2          1

gives back the Nth harmonic number (denoted HN) and is Θ(log N). More than just being Θ(log N), it's very, very close to ln N.

With that in mind, we can do this rewrite:

      1     2     3           N
     --- + --- + --- + ... + ---
      N    N-1   N-2          1

      1     1     1           1
=    --- + --- + --- + ... + ---
      N    N-1   N-2          1

            1     1           1    
+          --- + --- + ... + ---
           N-1   N-2          1

                  1           1    
+                --- + ... + ---
                 N-2          1

+                      ...


                              1    
+                            ---
                              1

The basic idea here is to treat (k + 1) / N as (k + 1) copies of the fraction 1 / N, and then to regroup them into separate rows like this.

Once we've done this, notice that the top row is the Nth harmonic number Hn, and the item below that is the (N - 1)st harmonic number Hn-1, and the item below that is the (N - 2)nd harmonic number Hn - 2, etc. So this means that our fraction sum works out to

H1 + H2 + H3 + ... + HN

= Θ(log 1 + log 2 + log 3 + ... + log N)

= Θ(log N!) (properties of logs)

= Θ(N log N) (Stirling's approximation).

Multiplying this in by the original factor of N that we pulled out earlier, we get that the overall runtime is Θ(N2 log N).

So, does that hold up in practice? To find out, I ran the code over a range of inputs and counted the average number of iterations of the loop in isThere. I then divided each term by log N and did a polynomial-fit regression to see how closely the remainder matched Θ(N2). The regression found that the best polynomial plot had a polynomial term of N2.01, strongly supporting that (after multiplying back in the log N term) we're looking at N2 log N. (Note that running the same regression but without first dividing out the log N term shows a fit of N2.15, which clearly indicates something other than N2 is going on here.)

Using the the equation Predicted(N) = (N2 ln N) / 1.06, with that last constant determined empirically, we get the plot up above, which is almost a perfect match.


Now, a quick coda to this problem. In retrospect, I should have predicted that the runtime would be Θ(N2 log N) almost immediately. The reason why is that this problem is closely connected to the coupon collector's problem, a classic probability puzzle. In that puzzle, there are N different coupons to collect, and you keep collecting coupons at random. The question is then - on expectation, how many coupons do you need to collect before you'll have one copy of each of them?

That closely matches the problem we have here, where at each step we're picking from a bag of N+1 options and trying to eventually pick all of them. That would imply we need Θ(N log N) random draws across all iterations. However, the cost of each draw depends on which loop iteration we're in, with later draws costing more than earlier ones. Based on the assumption that most of the draws would be in the back half, I should have been able to guess that we'd do an average of Θ(N) work per draw, then multiplied that to get Θ(N2 log N). But instead, I just rederived things from scratch. Oops.

Phew, that was a trip! Hope this helps!

like image 115
templatetypedef Avatar answered Feb 01 '23 08:02

templatetypedef