Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Average time complexity of finding top-k elements

Consider the task of finding the top-k elements in a set of N independent and identically distributed floating point values. By using a priority queue / heap, we can iterate once over all N elements and maintain a top-k set by the following operations:

  • if the element x is "worse" than the heap's head: discard x ⇒ complexity O(1)

  • if the element x is "better" than the heap's head: remove the head and insert x ⇒ complexity O(log k)

The worst case time complexity of this approach is obviously O(N log k), but what about the average time complexity? Due to the iid-assumption, the probability of the O(1) operation increases over time, and we rarely have to perform the costly O(log k), especially for k << N.

Is this average time complexity documented in any citable reference? What's the average time complexity? If you have a citeable reference for your answer please include it.

like image 868
bluenote10 Avatar asked Oct 20 '22 18:10

bluenote10


1 Answers

Consider the i'th largest element, and a particular permutation. It'll inserted into the k-sized heap if it appears before no more than k-1 of the (i - 1) larger elements in the permutation.

The probability of that heap-insertion happening is 1 if i <= k, and k/i if i > k.

From this, you can compute the expectation of the number of heap adjustments, using linearity of expectation. It's sum(i = 1 to k)1 + sum(i = k+1 to n)k/i = k + sum(i = k+1 to n)k/i = k * (1 + H(n) - H(k)), where H(n) is the n'th harmonic number.

This is approximately k log(n) (for k << n), and you can compute your average cost from there.

like image 117
Paul Hankin Avatar answered Nov 11 '22 17:11

Paul Hankin