Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Produce a file that has integers common to two large files containing integers

Tags:

algorithm

Specifically, given two large files with 64-bit integers produce a file with integers that are present in both files and estimate the time complexity of your algorithm.

How would you solve this?

like image 396
ARV Avatar asked Jun 29 '11 12:06

ARV


3 Answers

I changed my mind; I actually like @Ryan's radix sort idea, except I would adapt it a bit for this specific problem.

Let's assume there are so many numbers that they do not fit in memory, but we have all the disk we want. (Not unreasonable given how the question was phrased.)

Call the input files A and B.

So, create 512 new files; call them file A_0 through A_255 and B_0 through B_255. File A_0 gets all of the numbers from file A whose high byte is 0. File A_1 gets all of the numbers from file A whose high byte is 1. File B_37 gets all the numbers from file B whose high byte is 37. And so on.

Now all possible duplicates are in (A_0, B_0), (A_1, B_1), etc., and those pairs can be analyzed independently (and, if necessary, recursively). And all disk accesses are reasonably linear, which should be fairly efficient. (If not, adjust the number of bits you use for the buckets...)

This is still O(n log n), but it does not require holding everything in memory at any time. (Here, the constant factor in the radix sort is log(2^64) or thereabouts, so it is not really linear unless you have a lot more than 2^64 numbers. Unlikely even for the largest disks.)

[edit, to elaborate]

The whole point of this approach is that you do not actually have to sort the two lists. That is, with this algorithm, at no time can you actually enumerate the elements of either list in order.

Once you have the files A_0, B_0, A_1, B_1, ..., A_255, B_255, you simply observe that no numbers in A_0 can be the same as any number in B_1, B_2, ..., B_255. So you start with A_0 and B_0, find the numbers common to those files, append them to the output, then delete A_0 and B_0. Then you do the same for A_1 and B_1, A_2 and B_2, etc.

To find the common numbers between A_0 and B_0, you just recurse... Create file A_0_0 containing all elements of A_0 with second byte equal to zero. Create file A_0_1 containing all elements of A_0 with second byte equal to 1. And so forth. Once all elements of A_0 and B_0 have been bucketed into A_0_0 through A_0_255 and B_0_0 through B_0_255, you can delete A_0 and B_0 themselves because you do not need them anymore.

Then you recurse on A_0_0 and B_0_0 to find common elements, deleting them as soon as they are bucketed... And so on.

When you finally get down to buckets that only have one element (possibly repeated), you can immediately decide whether to append that element to the output file.

At no time does this algorithm consume more than 2+epsilon times the original space required to hold the two files, where epsilon is less than half a percent. (Proof left as an exercise for the reader.)

I honestly believe this is the most efficient algorithm among all of these answers if the files are too large to fit in memory. (As a simple optimization, you can fall back to the std::set solution if and when the "buckets" get small enough.)

like image 176
Nemo Avatar answered Nov 13 '22 00:11

Nemo


You could a radix sort, then iterate over the sorted results keeping the matches . Radix is O(DN), where D is the number of digits in the numbers. The largest 64 bit number is 19 digits long, so the sort sort for 64 bit integers with a radix of 10 will run in about 19N, or O(N), and the search runs in O(N). Thus this would run in O(N) time, where N is the number of integers in both files.

like image 3
Ryan Gross Avatar answered Nov 12 '22 22:11

Ryan Gross


Assuming the files are too large to fit into memory, use an external least-significant-digit (LSD) radix sort on each of the files, then iterate through both files to find the intersection:


external LSD sort on base N (N=10 or N=100 if the digits are in a string format, N=16/32/64 if in binary format):

Create N temporary files (0 - N-1). Iterate through the input file. For each integer, find the rightmost digit in base N, and append that integer to the temporary file corresponding to that digit.

Then create a new set of N temporary files, iterate through the previous set of temporary files, find the 2nd-to-the-rightmost digit in base N (prepending 0s where necessary), and append that integer to the new temporary file corresponding to that digit. (and delete the previous set of temporary files)

Repeat until all the digits have been covered. The last set of temporary files contains the integers in sorted order. (Merge if you like into one file, otherwise treat the temporary files as one list.)


Finding the intersection:

Iterate through the sorted integers in each file to produce a pair of iterators that point to the current integer in each file. For each iterator, if the numbers match, append to an output list, and advance both iterators. Otherwise, for the smaller number, throw it away and advance the iterator for that file. Stop when either iterator ends.

(This outputs duplicates where there are input duplicates. If you want to remove duplicates, then the "advance the iterator" step should advance the iterator until the next larger number appears or the file ends.)

like image 3
Jason S Avatar answered Nov 12 '22 23:11

Jason S