What is an efficient way to compute the Dice coefficient between 900,000 strings?

Tags:

I have a corpus of 900,000 strings. They vary in length, but have an average character count of about 4,500. I need to find the most efficient way of computing the Dice coefficient of every string as it relates to every other string. Unfortunately, this results in the Dice coefficient algorithm being used some 810,000,000,000 times.

What is the best way to structure this program for increased efficiency? Obviously, I can prevent computing the Dice of sections A and B, and then B and A--but this only halves the work required. Should I consider taking some shortcuts or creating some sort of binary tree?

I'm using the following implementation of the Dice coefficient algorithm in Java:

public static double diceCoefficient(String s1, String s2) {
    Set<String> nx = new HashSet<String>();
    Set<String> ny = new HashSet<String>();

    for (int i = 0; i < s1.length() - 1; i++) {
        char x1 = s1.charAt(i);
        char x2 = s1.charAt(i + 1);
        String tmp = "" + x1 + x2;
        nx.add(tmp);
    }
    for (int j = 0; j < s2.length() - 1; j++) {
        char y1 = s2.charAt(j);
        char y2 = s2.charAt(j + 1);
        String tmp = "" + y1 + y2;
        ny.add(tmp);
    }

    Set<String> intersection = new HashSet<String>(nx);
    intersection.retainAll(ny);
    double totcombigrams = intersection.size();

    return (2 * totcombigrams) / (nx.size() + ny.size());
}

My ultimate goal is to output an ID for every section that has a Dice coefficient of greater than 0.9 with another section.

Thanks for any advice that you can provide!

921

asked Feb 17 '12 21:02

Fred Milton

1 Answers

Make a single pass over all the Strings, and build up a HashMap which maps each bigram to a set of the indexes of the Strings which contain that bigram. (Currently you are building the bigram set 900,000 times, redundantly, for each String.)

Then make a pass over all the sets, and build a HashMap of [index,index] pairs to common-bigram counts. (The latter Map should not contain redundant pairs of keys, like [1,2] and [2,1] -- just store one or the other.)

Both of these steps can easily be parallelized. If you need some sample code, please let me know.

NOTE one thing, though: from the 26 letters of the English alphabet, a total of 26x26 = 676 bigrams can be formed. Many of these will never or almost never be found, because they don't conform to the rules of English spelling. Since you are building up sets of bigrams for each String, and the Strings are so long, you will probably find almost the same bigrams in each String. If you were to build up lists of bigrams for each String (in other words, if the frequency of each bigram counted), it's more likely that you would actually be able to measure the degree of similarity between Strings, but then the calculation of Dice's coefficient as given in the Wikipedia article wouldn't work; you'd have to find a new formula.

I suggest you continue researching algorithms for determining similarity between Strings, try implementing a few of them, and run them on a smaller set of Strings to see how well they work.

answered Sep 25 '22 06:09

Alex D

Related questions
                            
                                Managing Concurrent Access in a Singleton Session Bean
                            
                                C/Linux - Server <-> Terminal communication with named pipes
                            
                                Concurrency: Java Map
                            
                                What is the advantage of forking a stream over just using multiple streams?
                            
                                Child thread not seeing updates made by main thread
                            
                                Bug in clang thread_local initialization
                            
                                Multi-threaded bisection search
                            
                                Is it safe to mix boost::thread with C++11 std::mutex?
                            
                                Where is the race in this thread sanitzer warning?
                            
                                JVM Thread dumps containing monitors without locking threads
                            
                                Using callgrind/kcachegrind to get per-thread statistics
                            
                                Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?
                            
                                Efficient consumer thread with multiple producers
                            
                                Socket/threading problem: The Undo operation encountered a context that is different from what was applied in the corresponding Set operation
                            
                                Asynchronous io in c using windows API: which method to use and why does my code execute synchronous?
                            
                                What's the meaning of thread concurrency overhead time in the profiler output?
                            
                                Is there any practical difference between Ruby pre-1.9 and Ruby 1.9 threads?
                            
                                is there a good thread tracer for C/C++ like Haskell's Threadscope?
                            
                                onSaveInstanceState/onPause - wait until state is fully saved before allowing process to be killed
                            
                                Understanding JVM's "Attach Listener" thread

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is an efficient way to compute the Dice coefficient between 900,000 strings?

Tags:

string

algorithm

multithreading

Fred Milton

People also ask

1 Answers

Alex D

Recent Activity

Donate For Us