Find sql records containing similar strings

Tags:

I have the following table with 2 columns: ID and Title containing over 500.000 records. For example:

ID  Title --  ------------------------ 1   Aliens 2   Aliens (1986) 3   Aliens vs Predator 4   Aliens 2 5   The making of "Aliens"

I need to find records that are very similar, and by that I mean they are different by 3-6 letters, usually this difference is at the end of the Titles. So I have to design a query that returns the records no. 1,2 and 4. I already looked at levenstein distance but I don't know how to apply it. Also because of the number of records the query shouldn't take all night long.

Thanks for any idea or suggestion

494

asked Mar 14 '11 14:03

Nial

1 Answers

If you really want to define similarity in the exact way that you have formulated in your question, then you would - as you say - have to implement the Levensthein Distance calculation. Either in code calculated on each row retrieved by a DataReader or as a SQL Server function.

The problem stated is actually more tricky than it may appear at first sight, because you cannot assume to know what the mutually shared elements between two strings may be.

So in addition to Levensthein Distance you probably also want to specify a minimum number of consecutive characters that actually have to match (in order for sufficient similarity to be concluded).

In sum: It sounds like an overly complicated and time consuming/slow approach.

Interestingly, in SQL Server 2008 you have the DIFFERENCE function which may be used for something like this.

It evaluates the phonetic value of two strings and calculates the difference. I'm unsure if you will get it to work properly for multi-word expressions such as movie titles since it doesn't deal well with spaces or numbers and puts too much emphasis on the beginning of the string, but it is still an interesting predicate to be aware of.

If what you are actually trying to describe is some sort of search feature, then you should look into the Full Text Search capabilities of SQL Server 2008. It provides built-in Thesaurus support, fancy SQL predicates and a ranking mechanism for "best matches"

EDIT: If you are looking to eliminate duplicates maybe you could look into SSIS Fuzzy Lookup and Fuzzy Group Transformation. I have not tried this myself, but it looks like a promising lead.

EDIT2: If you don't want to dig into SSIS and still struggle with the performance of the Levensthein Distance algorithm, you could perhaps try this algorithm which appears to be less complex.

answered Sep 18 '22 23:09

12 revs, 2 users 86%

Related questions
                            
                                How to send/receive SOAP request and response using C#?
                            
                                getSelection() not working in IE
                            
                                What's packageContext in Intent#(Context packageContext, Class<?> cls)?
                            
                                How to cancel/revert changes to an observable model (or replace model in array with untouched copy)
                            
                                Downcast in a diamond hierarchy
                            
                                How to export Trac to Github Issues
                            
                                What's a good name for a façade class?
                            
                                Erroneous for-loops in Java?
                            
                                django and backbone.js questions
                            
                                Download theme from wordpress admin area without FTP
                            
                                std::atomic<int> decrement and comparison
                            
                                Page Object Model Best Practices in Selenium

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With