Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to compute similarity between two strings in MYSQL

if i have two strings in mysql:

 @a="Welcome to Stack Overflow" @b=" Hello to stack overflow"; 

is there a way to get the similarity percentage between those two string using MYSQL? here for example 3 words are similar and thus the similarity should be something like:
count(similar words between @a and @b) / (count(@a)+count(@b) - count(intersection))
and thus the result is 3/(4 + 4 - 3)= 0.6
any idea is highly appreciated!

like image 249
Lina Avatar asked Mar 16 '11 08:03

Lina


People also ask

How to check similarity between two strings?

Hamming Distance, named after the American mathematician, is the simplest algorithm for calculating string similarity. It checks the similarity by comparing the changes in the number of positions between the two strings.

How to check If two strings are equal in MySQL?

STRCMP() function in MySQL is used to compare two strings. If both of the strings are same then it returns 0, if the first argument is smaller than the second according to the defined order it returns -1 and it returns 1 when the second one is smaller the first one.


1 Answers

you can use this function (cop^H^H^Hadapted from http://www.artfulsoftware.com/infotree/queries.php#552):

CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int(11)     DETERMINISTIC BEGIN      DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;      DECLARE s1_char CHAR;      DECLARE cv0, cv1 text;      SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;      IF s1 = s2 THEN        RETURN 0;      ELSEIF s1_len = 0 THEN        RETURN s2_len;      ELSEIF s2_len = 0 THEN        RETURN s1_len;      ELSE        WHILE j <= s2_len DO          SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;        END WHILE;        WHILE i <= s1_len DO          SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;          WHILE j <= s2_len DO            SET c = c + 1;            IF s1_char = SUBSTRING(s2, j, 1) THEN               SET cost = 0; ELSE SET cost = 1;            END IF;            SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;            IF c > c_temp THEN SET c = c_temp; END IF;              SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;              IF c > c_temp THEN                 SET c = c_temp;               END IF;              SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;          END WHILE;          SET cv1 = cv0, i = i + 1;        END WHILE;      END IF;      RETURN c;    END 

and for getting it as XX% use this function

CREATE FUNCTION `levenshtein_ratio`( s1 text, s2 text ) RETURNS int(11)     DETERMINISTIC BEGIN      DECLARE s1_len, s2_len, max_len INT;      SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);      IF s1_len > s2_len THEN         SET max_len = s1_len;       ELSE         SET max_len = s2_len;       END IF;      RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);    END 
like image 55
Alaa Avatar answered Oct 06 '22 00:10

Alaa