I asked a similar question a couple months back. Located here: MySQL Query based on string
The problem I am running into is that this works in only one particular order and in some cases it works too well.
Here is a snippet of the data that this query is filtering out (duplicates are intended, actual data):
- BELLMORE
- ATLANTIC BCH
- ATLANTIC BEACH
- E HILLS
- EAST HILLS
- EAST ROCKAWAY
- FAR ROCKAWAY
- FLORAL PARK
- FLORAL PARK
- HIGHLAND HEIGHTS
- N HIGHLAND HGTS
- NORTH HIGHLAND HEIGHTS
The one query that helped in my last question (MySQL Query based on string) worked well for one instance and failed for the other instance. Here is the query:
select names from tablename group by substring_index(names," ",1)
Which returns:
- BELLMORE
- ATLANTIC BEACH
- EAST HILLS
- FAR ROCKAWAY
- FLORAL PARK
- HIGHLAND HEIGHTS
- N HIGHLAND HGTS
- NORTH HIGHLAND HEIGHTS
The problem with that one is that as you can see it removed a city that it should not have because it was only using the first word to group it by. The one it removed was:
- EAST ROCKAWAY
It was GROUP'ed BY EAST.
As I continue to write this I feel it is nearly impossible because the position of the static city name vs. the variable parts is always changing. Unless you can compare a certain amount of characters. Which is not flawless by far. If someone thinks they may have some insight or have worked with and accomplished such a thing I will appreciate the feedback and guidance. The ending result would be:
- BELLMORE
- ATLANTIC BEACH
- EAST HILLS
- EAST ROCKAWAY
- FAR ROCKAWAY
- FLORAL PARK
- HIGHLAND HEIGHTS
In SQL Server, there are many built-in string functions that can be used by developers. We can compare strings by using the IF-ELSE statement.
Learn MySQL from scratch for Data Science and AnalyticsIt can be used to compare two values. If you use double equal sign(==) in MySQL, you will get an error message. Now, compare the above variable value with 10. If both the values are same then the result will be 1 otherwise 0.
STRCMP String function STRCMP is a function of string that compares the specified two strings and gives 0 if the length of the first string is equal to the length of the second string. If the length of the first string is more than the length of the second string, then the function returns 1 otherwise -1.
The comparison operators also work on strings. To see if two strings are equal you simply write a boolean expression using the equality operator. print("Yes, we have bananas!")
My suggestion will be an expensive query, but hopefully you could use this type of operation to perform an occasional "cleaning" of your data so that it won't be required every time you query this data.
You might consider looking in a Levenshtein distance formula... which is a string metric for measuring the amount of difference between two sequences.
In order to avoid needing to calculate the distance for a cartesian product of your table, you could first narrow the set of cities and addresses to be compared with a quicker sanity check... such as they begin with the same letter, and have a similar length.
Initially, You could then start off by only returning records with a very small Levenshtein distance... You could then select one variation of the matches returned to apply to the other records in order to normalize your data.
You could then gradually increase the distance until you start to get too many false positives.
Here's an implementation directly in MySql:
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With