Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding k-largest elements of a very large file (while k is very LARGE)

Let's assume that we have a very large file which contains billions of integers , and we want to find k largest elements of these values ,

the tricky part is that k itself is very large too , which means we cannot keep k elements in the memory (for example we have a file with 100 billon elements and we want to find 10 billion largest elements)

How can we do this in O(n) ?

What I thought :

We start reading the file and we check it with another file which keeps the k largest elements (sorted in increasing order) , if the read element is larger than the first line of the second file we delete the first line and we insert it into the second file , the time complexity would be of O(NlogK) (if we have random access to that file , otherwise it would be 'O(Nk)'

Any idea to do this in O(n) , I guess if we have external version of Selection algorithm (the partitioning algorithm in quicksort) we would be able to do this in O(n) but I couldn't find it anywhere

like image 698
Arian Avatar asked Jul 01 '13 17:07

Arian


2 Answers

PS: My definition of K is different. It is a smallish number say 2 or 100 or 1000. Here m corresponds to OPS's definition of k. Sorry about this.

Depends on how many reads you can do of the original data and how much more space you have. This approach assumes you have extra space equivalent to the original data.

Step 1: Pick K random numbers across the whole data
Step 2: Sort the K numbers (assume index are from 1 to K)
Step 3: Create K+1 separate files and name them 0 to K
Step 4: For every element in the data, if it is between ith and i+th element put it in ith file.
Step 5: Based on the size of each file, choose the file that is going to have mth number.
Step 6: Repeat everything with the new file and new m (new_m = m - sum_of_size_of_all_lower_files)

Regarding the last step, if K=2, m=1000 and size of file 0 is 800, 1 is 900 and 2 is 200, new_m = m-800 = 200 and work through file 1 iteratively.

like image 77
ElKamina Avatar answered Oct 25 '22 04:10

ElKamina


You can do this pretty easily with a standard merge type algorithm.

Say you have 100 billion numbers and you want the top 10 billion. We'll say you can hold 1 billion numbers in memory at any time.

So you make a pass:

while not end of input
    read 1 billion numbers
    sort them in descending order
    save position of output file
    write sorted numbers to output file

You then have a file that contains 100 blocks of 1 billion numbers each. Each block is sorted in descending order.

Now create a max heap. Add the first number of each block to the heap. You'll also have to add the block number or the number's position in the file so that you can read the next number.

Then:

while num_selected < 10 billion
    selected = heap.remove()
    ++num_selected
    write selected to output
    read next number from the selected block and place on heap

There's a small bit of complexity involved, keeping track of which block the number came from, but it's not too bad.

The max heap never contains more than 100 items (basically, one item per block), so memory isn't an issue in the second pass. With a bit of work, you can avoid a lot of reads by creating a smallish buffer for each block so that you don't incur the cost of a disk read for every number that's selected.

It's basically just a disk merge sort, but with an early out.

Complexity of the first pass is b * (m log m), where b is the number of blocks and m is the number of items in a block. N, the total number of items in the file, is equal to b * m. Complexity of the second pass is k log b, where k is the number of items to select and b is the number of blocks.

like image 39
Jim Mischel Avatar answered Oct 25 '22 03:10

Jim Mischel