I am looking for a simple way (UDF?) to establish the similarity between strings. The SOUNDEX
and DIFFERENCE
function do not seem to do the job.
Similarity should be based on number of characters in common (order matters).
For example:
Spiruroidea sp. AM-2008
and
Spiruroidea gen. sp. AM-2008
should be recognised as similar.
Any pointers would be very much appreciated.
Thanks.
Christian
You may want to consider implementing the Levenshtein Distance algorithm as a UDF, so that it will return the number of operations that need to be performed on String A in order for it to become String B. This is often referred to as the edit distance.
You can then compare the result of the Levenshtein Distance function against a fixed threshold, or against a percentage length of String A or String B.
You would simply use it as follows:
WHERE LEVENSHTEIN(Field_A, Field_B) < 4;
You may want to check out the following Levenshtein Distance implementation for SQL Server:
These kind of things are not trivial and you should provide more examples.
As mentioned by Daniel levenshtein distance is a way to go, but also for your example you might want to pre-process the strings if you know that you can safely drop certain words - for example it seems from your example that the word gen. can be dropped.
The levenshtein distance will consider any four letter word instead of gen. as the same as gen. which might not be what you want.
Also if your data set will come from different data sources you might consider building a dictionary of synonyms and investigate existing standard taxonomies for your domain. Perhaps such as this?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With