I have a corpus of 900,000 strings. They vary in length, but have an average character count of about 4,500. I need to find the most efficient way of computing the Dice coefficient of every string as it relates to every other string. Unfortunately, this results in the Dice coefficient algorithm being used some 810,000,000,000 times.
What is the best way to structure this program for increased efficiency? Obviously, I can prevent computing the Dice of sections A and B, and then B and A--but this only halves the work required. Should I consider taking some shortcuts or creating some sort of binary tree?
I'm using the following implementation of the Dice coefficient algorithm in Java:
public static double diceCoefficient(String s1, String s2) {
Set<String> nx = new HashSet<String>();
Set<String> ny = new HashSet<String>();
for (int i = 0; i < s1.length() - 1; i++) {
char x1 = s1.charAt(i);
char x2 = s1.charAt(i + 1);
String tmp = "" + x1 + x2;
nx.add(tmp);
}
for (int j = 0; j < s2.length() - 1; j++) {
char y1 = s2.charAt(j);
char y2 = s2.charAt(j + 1);
String tmp = "" + y1 + y2;
ny.add(tmp);
}
Set<String> intersection = new HashSet<String>(nx);
intersection.retainAll(ny);
double totcombigrams = intersection.size();
return (2 * totcombigrams) / (nx.size() + ny.size());
}
My ultimate goal is to output an ID for every section that has a Dice coefficient of greater than 0.9 with another section.
Thanks for any advice that you can provide!
3. Dice Coefficient (F1 Score) Simply put, the Dice Coefficient is 2 * the Area of Overlap divided by the total number of pixels in both images.
Dice coefficient shouldn't be greater than 1. A dice coefficient usually ranges from 0 to 1. If you are getting a coefficient greater than 1, maybe you need to check your implementation.
similarity = dice( BW1 , BW2 ) computes the Sørensen-Dice similarity coefficient between binary images BW1 and BW2 . similarity = dice( L1 , L2 ) computes the Dice index for each label in label images L1 and L2 .
Make a single pass over all the Strings, and build up a HashMap which maps each bigram to a set of the indexes of the Strings which contain that bigram. (Currently you are building the bigram set 900,000 times, redundantly, for each String.)
Then make a pass over all the sets, and build a HashMap of [index,index] pairs to common-bigram counts. (The latter Map should not contain redundant pairs of keys, like [1,2] and [2,1] -- just store one or the other.)
Both of these steps can easily be parallelized. If you need some sample code, please let me know.
NOTE one thing, though: from the 26 letters of the English alphabet, a total of 26x26 = 676 bigrams can be formed. Many of these will never or almost never be found, because they don't conform to the rules of English spelling. Since you are building up sets of bigrams for each String, and the Strings are so long, you will probably find almost the same bigrams in each String. If you were to build up lists of bigrams for each String (in other words, if the frequency of each bigram counted), it's more likely that you would actually be able to measure the degree of similarity between Strings, but then the calculation of Dice's coefficient as given in the Wikipedia article wouldn't work; you'd have to find a new formula.
I suggest you continue researching algorithms for determining similarity between Strings, try implementing a few of them, and run them on a smaller set of Strings to see how well they work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With