Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Querying the record of database for almost similar matching String value

Tags:

java

mysql

The scenario is that i have a bulky database of around 500,000 records having address and city field in which there is no such standard way of inserting the value so multiple users, for example some have inserted their city value as bangalore and another have inserted its city value as begaluru or benglore(misspelled)

Also in case of address field same user with multiple record have inserted its address values but the values are not exaclty same for example Mountville park Thomas gate and Montlee park thonas gte.

I need to fetch all those record those are having same and almost similar values(somehow missplelled) of address and city.

Is there any way to get those records with almost similar but unmatched values?

Thankyou.

like image 720
Nishit Jain Avatar asked Nov 04 '22 05:11

Nishit Jain


1 Answers

It will be an expensive query, but since this will hopefully be a one-time operation, you might consider looking in a Levenshtein distance formula.

In order to avoid needing to calculate the distance for a cartesian product of your table, you could first narrow the set of cities and addresses to be compared with a quicker sanity check... such as they begin with the same letter, and have a similar length.

You could then start off by only returning records with a very small Levenshtein distance, and then gradually increasing the distance until you start to get too many false positives.

Here's an implementation directly in MySql:

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    -- max strlen=255 
    DECLARE cv0, cv1 VARBINARY(256); 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END;

This function could then be used in a a helper function as follows:

CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END; 

You could also optimize the levenshtein function by passing in your current max distance... if the function passes that distance, exit without calculating the exact distance.

like image 134
Michael Fredrickson Avatar answered Nov 07 '22 23:11

Michael Fredrickson