Hi and thanks for looking!
I have an XML file that contains 1900 nodes which themselves contain a string of encoded data of about 3400 characters.
As part of a use case for an application I am developing, I need to be able to take a "benchmark" string at runtime, and find the closest match from the XML file.
Please note that XML is not germane to the app and that I may go with SQL moving forward, but for today, I just needed an easy place to store the data and prove the concept.
I am using .NET 4.0, C#, forms app, LINQ, etc.
How do I find the closest match? Hamming? Levenshtein? There are plenty of code samples online, but most are geared towards small string comparisons ("ant" vs. "aunt") or exact match. I will rarely have exact matches; I just need closest match.
Thanks in advance!
Matt
You mentioned using Levenhstein's Edit Distance and that your strings were about 3400 characters long.
I did a quick try and using the dynamic programming version of Levenhstein's Edit Distance it seems to be quite fast and cause no issue.
I did this:
final StringBuilder sb1 = new StringBuilder();
final StringBuilder sb2 = new StringBuilder();
final Random r = new Random(42);
final int n = 3400;
for (int i = 0; i < n; i++) {
sb1.append( (char) ('a' + r.nextInt(26)) );
sb2.append( (char) ('a' + r.nextInt(26)) );
}
final long t0 = System.currentTimeMillis();
System.out.println("LED: " + getLevenshteinDistance(sb1.toString(), sb2.toString()) );
final long te = System.currentTimeMillis() - t0;
System.out.println("Took: " + te + " ms");
And it's finding the distance in 215 ms on a Core 2 Duo from 2006 or so.
Would that work for you?
(btw I'm not sure I can paste the code for the DP LED implementation I've got here so you probably should search the Internet for one Java implementation)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With