Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL - Similarity between two strings of varying length

I have a SQL Server table of products, and each product has a description that is publicly available on our website. I want to prevent, or at least warn our users when, a description is too similar to another product's description. Each product's description length can greatly vary.

I'd like query for products with descriptions that include duplicate/similar paragraphs/blocks of text between one another. i.e. String A has a bunch of unique content, but shares a similar/identical paragraph w/ string B. However, I'm not sure which similarity algorithm is best to use:

  • The Levenshtein distance and Jaro-Winler distance algorithms appear to only work well with short strings.

  • I'm not sure the longest common subsequence algorithm takes into account large differences very well. i.e. it appears to ignore potential space between two characters, finding any similar combination sequence.

Fuzzy hashing sort of sounds what I'm looking for, but I'm not just looking for duplicate content w/ subtle differences. I'm also looking for duplicate content w/ subtle differences injected within a unique block of text. And I'd have no idea how to implement fuzzy hashes in SQL. SOUNDEX() and DIFFERENCE() appear to use fuzzy hashing, but are quite imprecise for my use case.

Ideally the similarity SQL function would be fast, but I could store cached similarity values in another table and schedule a job to occasionally update.

What is the best algorithm/SQL (or CLR integration) implementation to accomplish this?

like image 394
David Budiac Avatar asked Sep 04 '13 22:09

David Budiac


People also ask

How do I compare two string lengths in SQL?

We can compare two or more strings using the STRCMP string function, LIKE operator, and Equal operator.

How can we check similarity between two strings in SQL?

Using the Levenshtein distance method This method can be used among others (Soundex, LIKE statement, Regexp) to perform string similarity or string matching in order to identify two elements (text, strings, inputs) that are similar but not identical.

How do you compare the similarity of two strings?

The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.

How does Soundex work in SQL?

SOUNDEX converts an alphanumeric string to a four-character code that is based on how the string sounds when spoken in English. The first character of the code is the first character of character_expression, converted to upper case.


1 Answers

I not-so-recently had to join group names by fuzzy string matching.
I have tried about 40 different algorithms, but none was good enough to do this,even though the groupnames writing only differed by some spelling mistakes, missing whitespaces, and occasional added _mLF at the end.

So if you attempt a similar thing, I strongly suggest you stop right now, and send the data (in my case Excel-file) back to the users for correction, where it belongs.

If you're really just interested in comparing strings, this link may be just what you need:
http://anastasiosyal.com/POST/2009/01/11/18.ASPX

I found the Jaro-Winkler function to yield the best results in my case, but you can test that for yourselfs.

like image 199
Stefan Steiger Avatar answered Oct 29 '22 22:10

Stefan Steiger