Similarity between strings - SQL Server 2005

Question

I am looking for a simple way (UDF?) to establish the similarity between strings. The SOUNDEX and DIFFERENCE function do not seem to do the job.

Similarity should be based on number of characters in common (order matters).

For example:

Spiruroidea sp. AM-2008

and

Spiruroidea gen. sp. AM-2008

should be recognised as similar.

Any pointers would be very much appreciated.

Thanks.

Christian

Daniel Vassallo · Accepted Answer

You may want to consider implementing the Levenshtein Distance algorithm as a UDF, so that it will return the number of operations that need to be performed on String A in order for it to become String B. This is often referred to as the edit distance.

You can then compare the result of the Levenshtein Distance function against a fixed threshold, or against a percentage length of String A or String B.

You would simply use it as follows:

WHERE LEVENSHTEIN(Field_A, Field_B) < 4;

You may want to check out the following Levenshtein Distance implementation for SQL Server:

Levenshtein Distance Algorithm: TSQL Implementation

Unreason · Answer

These kind of things are not trivial and you should provide more examples.

As mentioned by Daniel levenshtein distance is a way to go, but also for your example you might want to pre-process the strings if you know that you can safely drop certain words - for example it seems from your example that the word gen. can be dropped.

The levenshtein distance will consider any four letter word instead of gen. as the same as gen. which might not be what you want.

Also if your data set will come from different data sources you might consider building a dictionary of synonyms and investigate existing standard taxonomies for your domain. Perhaps such as this?

Similarity between strings - SQL Server 2005

Tags:

sql

sql-server-2005

user-defined-functions

cs0815

2 Answers

Daniel Vassallo

Unreason

Recent Activity

Donate For Us

Similarity between strings - SQL Server 2005

Tags:

sql

sql-server-2005

user-defined-functions

cs0815

2 Answers

Daniel Vassallo

Unreason

Related questions

Recent Activity

Donate For Us