Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Suggestions needed for optimizing O(n^2) algorithm

I am looking to optimize a fairly simple algorithm that is currently O(n2). I have a file of records, where each one needs to be compared to every other in the same file. If the two are the 'same' (comparator function is pretty complicated), the matched records are output. Note that there may be several records that match each other, and there is no sense of order - only if the match is True or False.

Pseudo code:


For (outRec in sourceFile) {
  Get new filePointer for targetFile //starting from the top of the file for inner loop
  For (inRec in targetFile) {
    if (compare(outRec, inRec) == TRUE ) {
      write outRec
      write inRec
    }
    increment some counters
  }
  increment some other counters
}

The data is not sorted in any way, and there is no preprocessing possible to order the data.

Any ideas on how this could become something less than O(n2)? I am thinking of applying the MapReduce paradigm on the code, breaking up the outer AND inner loops, possibly using a chained Map function. I am pretty sure I have the code figured out on Hadoop, but wanted to check alternatives before I spent time coding it.

Suggestions appreciated!

Added: Record types. Basically, I need to match names/strings. The types of matching are shown in the example below.


1,Joe Smith,Daniel Foster
2,Nate Johnson,Drew Logan
3,Nate Johnson, Jack Crank
4,Joey Smyth,Daniel Jack Foster
5,Joe Morgan Smith,Daniel Foster

Expected output: Records 1,4,5 form a match set End of output

Added: these files will be quite large. The largest file is expected to be around 200 million records.

like image 511
banncee Avatar asked Jul 12 '11 13:07

banncee


People also ask

How do you optimize an algorithm?

An optimization algorithm is a procedure which is executed iteratively by comparing various solutions till an optimum or a satisfactory solution is found. With the advent of computers, optimization has become a part of computer-aided design activities.

What are optimization methods?

Optimization methods seek to find variable values that optimize a multivariate objective function under a set of constraints. Constraints define a search space, also known as feasible region within which the solution must be enclosed.


4 Answers

I am not sure about the properties of your comparator and the data set, but assuming that your comparator defines an equivalence relation on your rows, here goes nothing:

  1. Create a map for the input file and use the comparator function as the key comparator of the map. The map values are a sequence/list of rows, i.e. all rows that are 'same' get successively added to the same map entry). Takes O(n*log n) time.
  2. Walk through the other file's rows and check if each row matches a key in the map. In that case, due to the equivalence relation implied by your comparator you know that this row is the 'same' as all the rows in the value of that map entry. Takes O(n* log n + C), depending on how many matches you have to output.

Note that in the worst case, according to your problem description, you cannot get any better than O(n^2), simply because there may be O(n^2) results of matching records that you have to output!

like image 87
Frank Avatar answered Oct 29 '22 13:10

Frank


Assuming the files aren't ridiculously large, I'd go through the file in its entirety, and compute a hash for the row, and keep track of hash/line # (or file pointer position) combinations. Then sort the list of hashes, and identify those that appear more than once.

like image 31
Joe Avatar answered Oct 29 '22 14:10

Joe


We'd need to know more about your comparison function. Is your comparison transitive? (That is, does A==B and B==C imply A==C?) Is it reflexive? (Does A==B imply B==A?)

If your comparison function is transitive and reflexive, and many records being equal is common, then you could bin your records into groups by comparing them to one "representative sample" of the group. That could approach O(N) in the best case.

Note that hashing the records assumes hash(A) == hash(B) <=> compare(A, B) == true, but if compare(A, B) can be true even when bytes(A) != bytes(B) it might be tricky to design an appropriate hashing algorithm.

like image 45
Dennis Avatar answered Oct 29 '22 14:10

Dennis


FYI MapReduce will not imrove the algorithmic complexity of the solution. It adds some overhead but then parallelizes it so that you can use the necessary resources in less wall-clock time.

To improve your wall-clock time the #1 thing to do is to find ways to avoid having to run the comparison. Any way of doing that will be a win. And even if your comparison logic is complex, you can still use sorting to help.

For instance suppose that you have some dimension that that data is spread out in. Data that varies by too much in that dimension is guaranteed to not compare equal, though being close in that dimension does not guarantee equality. Then what you can do is sort your data by that dimension, and then only run comparisons between elements that are close in that dimension. Voila! Most of the O(n*n) comparisons are now gone.

Let's make it more complex. Suppose you can identify two such dimensions that are independent from each other. Sort your data along the first such dimensions. Divide data in the first dimension into strips. (Make the strips overlap by the maximum they can vary in that dimension and still compare equal.) Now take each strip and sort it by the second dimension. Then run comparisons between pairs of elements that are acceptably close in that dimension, and include the pair in your answer if it compares equal, and this is the first strip it could appear in. (That dedup logic is needed because overlap may mean that a pair that compares equal can appear in multiple strips.) This is likely to be even better than the first approach because you've managed to narrow things down so that you're only comparing rows with a small number of "nearby" rows.

If you want to use less resources, you need to focus on ways to avoid having to actually make individual comparisons. Anything that you come up with on that path will help.

like image 39
btilly Avatar answered Oct 29 '22 13:10

btilly