The scenario is that i have a bulky database of around 500,000 records having address and city field in which there is no such standard way of inserting the value so multiple users, for example some have inserted their city value as bangalore and another have inserted its city value as begaluru or benglore(misspelled)
Also in case of address field same user with multiple record have inserted its address values but the values are not exaclty same for example Mountville park Thomas gate and Montlee park thonas gte.
I need to fetch all those record those are having same and almost similar values(somehow missplelled) of address and city.
Is there any way to get those records with almost similar but unmatched values?
Thankyou.
It will be an expensive query, but since this will hopefully be a one-time operation, you might consider looking in a Levenshtein distance formula.
In order to avoid needing to calculate the distance for a cartesian product of your table, you could first narrow the set of cities and addresses to be compared with a quicker sanity check... such as they begin with the same letter, and have a similar length.
You could then start off by only returning records with a very small Levenshtein distance, and then gradually increasing the distance until you start to get too many false positives.
Here's an implementation directly in MySql:
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END;
This function could then be used in a a helper function as follows:
CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN
SET max_len = s1_len;
ELSE
SET max_len = s2_len;
END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END;
You could also optimize the levenshtein
function by passing in your current max distance... if the function passes that distance, exit without calculating the exact distance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With