I am looking to optimize a fairly simple algorithm that is currently O(n2). I have a file of records, where each one needs to be compared to every other in the same file. If the two are the 'same' (comparator function is pretty complicated), the matched records are output. Note that there may be several records that match each other, and there is no sense of order - only if the match is True or False.
Pseudo code:
For (outRec in sourceFile) {
Get new filePointer for targetFile //starting from the top of the file for inner loop
For (inRec in targetFile) {
if (compare(outRec, inRec) == TRUE ) {
write outRec
write inRec
}
increment some counters
}
increment some other counters
}
The data is not sorted in any way, and there is no preprocessing possible to order the data.
Any ideas on how this could become something less than O(n2)? I am thinking of applying the MapReduce paradigm on the code, breaking up the outer AND inner loops, possibly using a chained Map function. I am pretty sure I have the code figured out on Hadoop, but wanted to check alternatives before I spent time coding it.
Suggestions appreciated!
Added: Record types. Basically, I need to match names/strings. The types of matching are shown in the example below.
1,Joe Smith,Daniel Foster
2,Nate Johnson,Drew Logan
3,Nate Johnson, Jack Crank
4,Joey Smyth,Daniel Jack Foster
5,Joe Morgan Smith,Daniel Foster
Expected output:
Records 1,4,5 form a match set
End of output
Added: these files will be quite large. The largest file is expected to be around 200 million records.
An optimization algorithm is a procedure which is executed iteratively by comparing various solutions till an optimum or a satisfactory solution is found. With the advent of computers, optimization has become a part of computer-aided design activities.
Optimization methods seek to find variable values that optimize a multivariate objective function under a set of constraints. Constraints define a search space, also known as feasible region within which the solution must be enclosed.
I am not sure about the properties of your comparator and the data set, but assuming that your comparator defines an equivalence relation on your rows, here goes nothing:
Note that in the worst case, according to your problem description, you cannot get any better than O(n^2), simply because there may be O(n^2) results of matching records that you have to output!
Assuming the files aren't ridiculously large, I'd go through the file in its entirety, and compute a hash for the row, and keep track of hash/line # (or file pointer position) combinations. Then sort the list of hashes, and identify those that appear more than once.
We'd need to know more about your comparison function. Is your comparison transitive? (That is, does A==B and B==C imply A==C?) Is it reflexive? (Does A==B imply B==A?)
If your comparison function is transitive and reflexive, and many records being equal is common, then you could bin your records into groups by comparing them to one "representative sample" of the group. That could approach O(N) in the best case.
Note that hashing the records assumes hash(A) == hash(B) <=> compare(A, B) == true, but if compare(A, B) can be true even when bytes(A) != bytes(B) it might be tricky to design an appropriate hashing algorithm.
FYI MapReduce will not imrove the algorithmic complexity of the solution. It adds some overhead but then parallelizes it so that you can use the necessary resources in less wall-clock time.
To improve your wall-clock time the #1 thing to do is to find ways to avoid having to run the comparison. Any way of doing that will be a win. And even if your comparison logic is complex, you can still use sorting to help.
For instance suppose that you have some dimension that that data is spread out in. Data that varies by too much in that dimension is guaranteed to not compare equal, though being close in that dimension does not guarantee equality. Then what you can do is sort your data by that dimension, and then only run comparisons between elements that are close in that dimension. Voila! Most of the O(n*n)
comparisons are now gone.
Let's make it more complex. Suppose you can identify two such dimensions that are independent from each other. Sort your data along the first such dimensions. Divide data in the first dimension into strips. (Make the strips overlap by the maximum they can vary in that dimension and still compare equal.) Now take each strip and sort it by the second dimension. Then run comparisons between pairs of elements that are acceptably close in that dimension, and include the pair in your answer if it compares equal, and this is the first strip it could appear in. (That dedup logic is needed because overlap may mean that a pair that compares equal can appear in multiple strips.) This is likely to be even better than the first approach because you've managed to narrow things down so that you're only comparing rows with a small number of "nearby" rows.
If you want to use less resources, you need to focus on ways to avoid having to actually make individual comparisons. Anything that you come up with on that path will help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With