Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Similarity between strings - SQL Server 2005

I am looking for a simple way (UDF?) to establish the similarity between strings. The SOUNDEX and DIFFERENCE function do not seem to do the job.

Similarity should be based on number of characters in common (order matters).

For example:

Spiruroidea sp. AM-2008

and

Spiruroidea gen. sp. AM-2008

should be recognised as similar.

Any pointers would be very much appreciated.

Thanks.

Christian

like image 896
cs0815 Avatar asked Apr 12 '10 11:04

cs0815


2 Answers

You may want to consider implementing the Levenshtein Distance algorithm as a UDF, so that it will return the number of operations that need to be performed on String A in order for it to become String B. This is often referred to as the edit distance.

You can then compare the result of the Levenshtein Distance function against a fixed threshold, or against a percentage length of String A or String B.

You would simply use it as follows:

WHERE LEVENSHTEIN(Field_A, Field_B) < 4;

You may want to check out the following Levenshtein Distance implementation for SQL Server:

  • Levenshtein Distance Algorithm: TSQL Implementation
like image 61
Daniel Vassallo Avatar answered Oct 04 '22 22:10

Daniel Vassallo


These kind of things are not trivial and you should provide more examples.

As mentioned by Daniel levenshtein distance is a way to go, but also for your example you might want to pre-process the strings if you know that you can safely drop certain words - for example it seems from your example that the word gen. can be dropped.

The levenshtein distance will consider any four letter word instead of gen. as the same as gen. which might not be what you want.

Also if your data set will come from different data sources you might consider building a dictionary of synonyms and investigate existing standard taxonomies for your domain. Perhaps such as this?

like image 30
Unreason Avatar answered Oct 04 '22 23:10

Unreason