Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL string comparison

I asked a similar question a couple months back. Located here: MySQL Query based on string

The problem I am running into is that this works in only one particular order and in some cases it works too well.

Here is a snippet of the data that this query is filtering out (duplicates are intended, actual data):

- BELLMORE
- ATLANTIC BCH
- ATLANTIC BEACH
- E HILLS
- EAST HILLS
- EAST ROCKAWAY
- FAR ROCKAWAY
- FLORAL PARK
- FLORAL PARK
- HIGHLAND HEIGHTS
- N HIGHLAND HGTS
- NORTH HIGHLAND HEIGHTS

The one query that helped in my last question (MySQL Query based on string) worked well for one instance and failed for the other instance. Here is the query:

select names from tablename group by substring_index(names," ",1)

Which returns:

- BELLMORE
- ATLANTIC BEACH
- EAST HILLS
- FAR ROCKAWAY
- FLORAL PARK
- HIGHLAND HEIGHTS
- N HIGHLAND HGTS
- NORTH HIGHLAND HEIGHTS

The problem with that one is that as you can see it removed a city that it should not have because it was only using the first word to group it by. The one it removed was:

- EAST ROCKAWAY

It was GROUP'ed BY EAST.

As I continue to write this I feel it is nearly impossible because the position of the static city name vs. the variable parts is always changing. Unless you can compare a certain amount of characters. Which is not flawless by far. If someone thinks they may have some insight or have worked with and accomplished such a thing I will appreciate the feedback and guidance. The ending result would be:

- BELLMORE
- ATLANTIC BEACH
- EAST HILLS
- EAST ROCKAWAY
- FAR ROCKAWAY
- FLORAL PARK
- HIGHLAND HEIGHTS
like image 695
csteel Avatar asked Mar 27 '12 22:03

csteel


People also ask

Can I compare two strings in SQL?

In SQL Server, there are many built-in string functions that can be used by developers. We can compare strings by using the IF-ELSE statement.

Can we use == in MySQL?

Learn MySQL from scratch for Data Science and AnalyticsIt can be used to compare two values. If you use double equal sign(==) in MySQL, you will get an error message. Now, compare the above variable value with 10. If both the values are same then the result will be 1 otherwise 0.

How do I compare two string lengths in SQL?

STRCMP String function STRCMP is a function of string that compares the specified two strings and gives 0 if the length of the first string is equal to the length of the second string. If the length of the first string is more than the length of the second string, then the function returns 1 otherwise -1.

Can you use comparison operators with strings?

The comparison operators also work on strings. To see if two strings are equal you simply write a boolean expression using the equality operator. print("Yes, we have bananas!")


1 Answers

My suggestion will be an expensive query, but hopefully you could use this type of operation to perform an occasional "cleaning" of your data so that it won't be required every time you query this data.

You might consider looking in a Levenshtein distance formula... which is a string metric for measuring the amount of difference between two sequences.

In order to avoid needing to calculate the distance for a cartesian product of your table, you could first narrow the set of cities and addresses to be compared with a quicker sanity check... such as they begin with the same letter, and have a similar length.

Initially, You could then start off by only returning records with a very small Levenshtein distance... You could then select one variation of the matches returned to apply to the other records in order to normalize your data.

You could then gradually increase the distance until you start to get too many false positives.

Here's an implementation directly in MySql:

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    -- max strlen=255 
    DECLARE cv0, cv1 VARBINARY(256); 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END;
like image 109
Michael Fredrickson Avatar answered Oct 14 '22 05:10

Michael Fredrickson