I've got two tables and I'd like to grab all of the schools where there is a similarly spelled school or the same school. For example:
my_table_a:
Olde School
New School
Other, C.S. School
Main School
Too Cool for School
my_table_b:
Old School
New ES
Other School
Main School
Hardknocks School
Is it possible to write a SELECT query that will find the similarly spelled schools in the two tables. Is there a way to use LIKE or wildcards on columns?
Something such as:
SELECT my_table_a.school, my_table_b.school
FROM ` my_table_a` , my_table_b
WHERE my_table_a.directory_school_name_09_10 LIKE my_table_b.school
I tried the above statement with my real tables and I simply got what '=' would have produced.
Basically, I want to grab the first 4 schools in each table's column. (Of course, in the real world, I won't know that the first 4 schools are similar).
Is what I'm trying to do even possible?
For a UDF implementation of the Levenshtein Distance algorithm you may want to check out "codejanitor.com: Levenshtein Distance as a MySQL Stored Function":
CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END
Now let's build a test case, using the data you provided in your question:
CREATE TABLE table_a (name varchar(20));
CREATE TABLE table_b (name varchar(20));
INSERT INTO table_a VALUES('Olde School');
INSERT INTO table_a VALUES('New School');
INSERT INTO table_a VALUES('Other, C.S. School');
INSERT INTO table_a VALUES('Main School');
INSERT INTO table_a VALUES('Too Cool for School');
INSERT INTO table_b VALUES('Old School');
INSERT INTO table_b VALUES('New ES');
INSERT INTO table_b VALUES('Other School');
INSERT INTO table_b VALUES('Main School');
INSERT INTO table_b VALUES('Hardknocks School');
Then:
SELECT *
FROM table_a a
LEFT JOIN table_b b ON (a.name = b.name);
Obviously returns a match where the school names match exactly:
+---------------------+-------------+
| name | name |
+---------------------+-------------+
| Olde School | NULL |
| New School | NULL |
| Other, C.S. School | NULL |
| Main School | Main School |
| Too Cool for School | NULL |
+---------------------+-------------+
5 rows in set (0.00 sec)
Now we can try to use the LEVENSHTEIN
function to return school names that have an edit distance of 2 characters or less:
SELECT *
FROM table_a a
LEFT JOIN table_b b ON (LEVENSHTEIN(a.name, b.name) <= 2);
+---------------------+-------------+
| name | name |
+---------------------+-------------+
| Olde School | Old School |
| New School | NULL |
| Other, C.S. School | NULL |
| Main School | Main School |
| Too Cool for School | NULL |
+---------------------+-------------+
5 rows in set (0.08 sec)
Now using <= 3
as an edit distance threshold:
SELECT *
FROM table_a a
LEFT JOIN table_b b ON (LEVENSHTEIN(a.name, b.name) <= 3);
We get the following result:
+---------------------+--------------+
| name | name |
+---------------------+--------------+
| Olde School | Old School |
| Olde School | Other School |
| New School | Old School |
| Other, C.S. School | NULL |
| Main School | Main School |
| Too Cool for School | NULL |
+---------------------+--------------+
6 rows in set (0.06 sec)
Note how this time Olde School
also matched Other School
, and New School
matched Old School
as well. These are probably false positive, and shows that defining the threshold is very important to avoid incorrect matches.
One common technique to tackle this problem is to take into consideration the length of the strings when applying a threshold. In fact, the site that I cited for this implementation also provides a LEVENSHTEIN_RATIO
function which returns the ratio (as a percentage) of the edit difference based on the length of the strings.
You can try comparing the result of calling SOUNDEX.
Or you can compare the Levenshtein distance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With