I have duplicate entries that are highly similar, but not exact. Here are some examples:
- 2016: Obama's America
- 2016: Obama's America (VF)
- Hurt Locker
- The Hurt Locker
What would be a query that I could use to get potentially similar titles
?
Please note that I am not trying to remove EXACT duplicates. I am only trying to select similar values in a single column.
I think this can be solved by measuring the distance between strings with some string metric.
Levenshtein seems to be the most well known metric and I have used some implementation of it in Oracle. It is implemented for MySQL also. You might find some other metric that will work better for you.
Not sure this is the best way or most efficient, and it definitely depends on the meaning of similar. If the meaning is the title contains all of the text in one row but some of the text in another row, then something like this should work:
SELECT DISTINCT T.Title
FROM YourTable T
LEFT JOIN YourTable T2 ON T.Title != T2.Title
WHERE T.Title LIKE CONCAT('%', T2.Title, '%')
UNION
SELECT DISTINCT T2.Title
FROM YourTable T
LEFT JOIN YourTable T2 ON T.Title != T2.Title
WHERE T.Title LIKE CONCAT('%', T2.Title, '%')
ORDER BY Title
And here is the SQL Fiddle.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With