Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to search SQL database for similar words (mean not identical words)?

Tags:

java

c#

php

mysql

Is there a way to search MySQL database for similar words (mean not identical words) . For example : user searches in database for word "abcd" and there is a word "abd" in the database so the search engine or the program Ask the user "Do you mean [abd] ? " as in most of search engines in the web ? Please notice that the search word is not a part of the existing word (can't use "like")

like image 671
EgyEast Avatar asked Aug 18 '10 23:08

EgyEast


People also ask

How to search for a specific text in a specific database?

Select the Object search command: 1 In the Search text field, enter the text that needs to be searched (e.g. ... 2 From the Database drop-down menu, select the database to search in 3 In the Objects drop-down list, select the object types to search in, or leave them all checked More items...

How to evaluate the similarity of two names in SQL?

Learn SQL for interviews using SQL Course by GeeksforGeeks. SOUNDEX () Function can find the inconsistency in names. SOUNDEX () can evaluate the similarity of two names. SOUNDEX () only works well when we do have 1 or 2 tokens. Names generally contain 1-2 tokens, so it works fine with Name.

What is an example of search in DBMS?

Some of the examples of search that a developer / DBA may make within a database or database server are: searching for a database object, searching for occurance of particular text within database objects, searching within the schema of a database object, search within the results of query results or entire tables, etc..

How to look for similar strings in a SQL Server column?

There are may ways to look for similar strings in a SQL Server column. The most common method is to make use of LIKE operator. Let us see the different ways to look for similar string in a table. Both charindex and patindex look for the match and return its position in the string.


2 Answers

Have a look at the Damerau-Levenshtein distance algorithm. It calculates the "distance" between two strings and determines how many steps it takes to transform one string into another. The less steps the closer the two strings are.

This article shows the algorithm implemented as a MySQL stored function.

The algorithm is so much better than LIKE or SOUNDEX.

I believe Google uses crowd sourced data rather than an algorithm. ie if a user types in abcd, clicks on the back button and then immediately searches for abd then it establishes a relationship between the two search terms as the user wasn't happy with the results. Once you have a very large community searching then the pattern appears.

like image 195
Dave Barker Avatar answered Oct 03 '22 13:10

Dave Barker


Since the link in Dave Barker's answer is dead, here is the code from an archived version of the website:

CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
        DECLARE s1_char CHAR;
        DECLARE cv0, cv1 VARBINARY(256);
        SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
        IF s1 = s2 THEN
          RETURN 0;
        ELSEIF s1_len = 0 THEN
          RETURN s2_len;
        ELSEIF s2_len = 0 THEN
          RETURN s1_len;
        ELSE
          WHILE j <= s2_len DO
            SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
          END WHILE;
          WHILE i <= s1_len DO
            SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
            WHILE j <= s2_len DO
                SET c = c + 1;
                IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            END WHILE;
            SET cv1 = cv0, i = i + 1;
          END WHILE;
        END IF;
        RETURN c;
      END

To note:

  • Maximum length of input strings is 255 characters. I’m sure you could edit the function to support more if needed.

  • I’ve tested it with international characters on a utf8_bin column and it seemed to work, but I’ve not tested that capability exstensively.

  • I’ve only tested it on MySQL 5.0+. No idea how it will work on versions less than that.

And as a bonus I also created a helper function that returns the ratio (as a percentage) of different : same characters which can be more helpful than just a straight edit distance (idea from here).

CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, max_len INT;
        SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
        IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
        RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
      END
like image 40
nepoh Avatar answered Oct 03 '22 14:10

nepoh