I am looking to optimize a fairly simple algorithm that is currently O(n2). I have a file of records, where each one needs to be compared to every other in the same file. If the two are the 'same' (comparator function is pretty complicated), the matched records are output. Note that there may be several records that match each other, and there is no sense of order - only if the match is True or False. Pseudo code: <pre class="prettyprint"><code> For (outRec in sourceFile) { Get new filePointer for targetFile //starting from the top of the file for inner loop For (inRec in targetFile) { if (compare(outRec, inRec) == TRUE ) { write outRec write inRec } increment some counters } increment some other counters } </code></pre> The data is not sorted in any way, and there is no preprocessing possible to order the data. Any ideas on how this could become something less than O(n2)? I am thinking of applying the MapReduce paradigm on the code, breaking up the outer AND inner loops, possibly using a chained Map function. I am pretty sure I have the code figured out on Hadoop, but wanted to check alternatives before I spent time coding it. Suggestions appreciated! Added: Record types. Basically, I need to match names/strings. The types of matching are shown in the example below. <pre class="prettyprint"><code> 1,Joe Smith,Daniel Foster 2,Nate Johnson,Drew Logan 3,Nate Johnson, Jack Crank 4,Joey Smyth,Daniel Jack Foster 5,Joe Morgan Smith,Daniel Foster Expected output: Records 1,4,5 form a match set End of output </code></pre> Added: these files will be quite large. The largest file is expected to be around 200 million records.

I am not sure about the properties of your comparator and the data set, but assuming that your comparator defines an equivalence relation on your rows, here goes nothing: <ol> <li>Create a map for the input file and use the comparator function as the key comparator of the map. The map values are a sequence/list of rows, i.e. all rows that are 'same' get successively added to the same map entry). Takes O(n*log n) time.</li> <li>Walk through the other file's rows and check if each row matches a key in the map. In that case, due to the equivalence relation implied by your comparator you know that this row is the 'same' as all the rows in the value of that map entry. Takes O(n* log n + C), depending on how many matches you have to output.</li> </ol> Note that in the worst case, according to your problem description, you cannot get any better than O(n^2), simply because there may be O(n^2) results of matching records that you have to output!

Suggestions needed for optimizing O(n^2) algorithm

Tags:

algorithm

optimization

hadoop

I am looking to optimize a fairly simple algorithm that is currently O(n²). I have a file of records, where each one needs to be compared to every other in the same file. If the two are the 'same' (comparator function is pretty complicated), the matched records are output. Note that there may be several records that match each other, and there is no sense of order - only if the match is True or False.

Pseudo code:


For (outRec in sourceFile) {
  Get new filePointer for targetFile //starting from the top of the file for inner loop
  For (inRec in targetFile) {
    if (compare(outRec, inRec) == TRUE ) {
      write outRec
      write inRec
    }
    increment some counters
  }
  increment some other counters
}

The data is not sorted in any way, and there is no preprocessing possible to order the data.

Any ideas on how this could become something less than O(n²)? I am thinking of applying the MapReduce paradigm on the code, breaking up the outer AND inner loops, possibly using a chained Map function. I am pretty sure I have the code figured out on Hadoop, but wanted to check alternatives before I spent time coding it.

Suggestions appreciated!

Added: Record types. Basically, I need to match names/strings. The types of matching are shown in the example below.


1,Joe Smith,Daniel Foster

2,Nate Johnson,Drew Logan

3,Nate Johnson, Jack Crank

4,Joey Smyth,Daniel Jack Foster

5,Joe Morgan Smith,Daniel Foster


Expected output:
Records 1,4,5 form a match set
End of output

Added: these files will be quite large. The largest file is expected to be around 200 million records.

511

asked Jul 12 '11 13:07

banncee

4 Answers

I am not sure about the properties of your comparator and the data set, but assuming that your comparator defines an equivalence relation on your rows, here goes nothing:

Create a map for the input file and use the comparator function as the key comparator of the map. The map values are a sequence/list of rows, i.e. all rows that are 'same' get successively added to the same map entry). Takes O(n*log n) time.
Walk through the other file's rows and check if each row matches a key in the map. In that case, due to the equivalence relation implied by your comparator you know that this row is the 'same' as all the rows in the value of that map entry. Takes O(n* log n + C), depending on how many matches you have to output.

Note that in the worst case, according to your problem description, you cannot get any better than O(n^2), simply because there may be O(n^2) results of matching records that you have to output!

answered Oct 29 '22 13:10

Frank

Assuming the files aren't ridiculously large, I'd go through the file in its entirety, and compute a hash for the row, and keep track of hash/line # (or file pointer position) combinations. Then sort the list of hashes, and identify those that appear more than once.

answered Oct 29 '22 14:10

Joe

We'd need to know more about your comparison function. Is your comparison transitive? (That is, does A==B and B==C imply A==C?) Is it reflexive? (Does A==B imply B==A?)

If your comparison function is transitive and reflexive, and many records being equal is common, then you could bin your records into groups by comparing them to one "representative sample" of the group. That could approach O(N) in the best case.

Note that hashing the records assumes hash(A) == hash(B) <=> compare(A, B) == true, but if compare(A, B) can be true even when bytes(A) != bytes(B) it might be tricky to design an appropriate hashing algorithm.

answered Oct 29 '22 14:10

Dennis

FYI MapReduce will not imrove the algorithmic complexity of the solution. It adds some overhead but then parallelizes it so that you can use the necessary resources in less wall-clock time.

To improve your wall-clock time the #1 thing to do is to find ways to avoid having to run the comparison. Any way of doing that will be a win. And even if your comparison logic is complex, you can still use sorting to help.

For instance suppose that you have some dimension that that data is spread out in. Data that varies by too much in that dimension is guaranteed to not compare equal, though being close in that dimension does not guarantee equality. Then what you can do is sort your data by that dimension, and then only run comparisons between elements that are close in that dimension. Voila! Most of the O(n*n) comparisons are now gone.

Let's make it more complex. Suppose you can identify two such dimensions that are independent from each other. Sort your data along the first such dimensions. Divide data in the first dimension into strips. (Make the strips overlap by the maximum they can vary in that dimension and still compare equal.) Now take each strip and sort it by the second dimension. Then run comparisons between pairs of elements that are acceptably close in that dimension, and include the pair in your answer if it compares equal, and this is the first strip it could appear in. (That dedup logic is needed because overlap may mean that a pair that compares equal can appear in multiple strips.) This is likely to be even better than the first approach because you've managed to narrow things down so that you're only comparing rows with a small number of "nearby" rows.

If you want to use less resources, you need to focus on ways to avoid having to actually make individual comparisons. Anything that you come up with on that path will help.

answered Oct 29 '22 13:10

btilly

Related questions
                            
                                Midpoint thick ellipse drawing algorithm
                            
                                Is there a difference between dfs and topological sort? Can topological ordering be achieved without using dfs?
                            
                                What is the complexity of HashMap#replace?
                            
                                determining points from set of pairwise distances
                            
                                Sort numbers by sum algorithm
                            
                                Information on L-Systems
                            
                                Find duplicates between arrays
                            
                                How to find rectangle of difference between two images
                            
                                Grouping consecutive identical items: IEnumerable<T> to IEnumerable<IEnumerable<T>>
                            
                                How to calculate the inverse factorial of a real number? [closed]
                            
                                Sphere - sphere collision detection -> reaction
                            
                                Optimal algorithm for generating a random number R not in a set of numbers N
                            
                                Generating regions on a 2D map
                            
                                How does one go about reverse engineering an algorithm?
                            
                                Binary tree from in-order and level-order traversals?
                            
                                Generate all strings under length N in C
                            
                                Any ideas about INTERESTING algorithm problems and examples for my students
                            
                                Overlapping Intervals
                            
                                The fastest algorithm determine range overlap
                            
                                Counting the number of substrings

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With