Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Good mysql query to find similar values in a single column

I have duplicate entries that are highly similar, but not exact. Here are some examples:

- 2016: Obama's America
- 2016: Obama's America (VF)

- Hurt Locker
- The Hurt Locker

What would be a query that I could use to get potentially similar titles ?

Update

Please note that I am not trying to remove EXACT duplicates. I am only trying to select similar values in a single column.

like image 907
David542 Avatar asked Feb 12 '13 20:02

David542


2 Answers

I think this can be solved by measuring the distance between strings with some string metric.

Levenshtein seems to be the most well known metric and I have used some implementation of it in Oracle. It is implemented for MySQL also. You might find some other metric that will work better for you.

like image 186
Bulat Avatar answered Oct 03 '22 21:10

Bulat


Not sure this is the best way or most efficient, and it definitely depends on the meaning of similar. If the meaning is the title contains all of the text in one row but some of the text in another row, then something like this should work:

SELECT DISTINCT T.Title
FROM YourTable T
   LEFT JOIN YourTable T2 ON T.Title != T2.Title
WHERE T.Title LIKE CONCAT('%', T2.Title, '%')
UNION 
SELECT DISTINCT T2.Title
FROM YourTable T
   LEFT JOIN YourTable T2 ON T.Title != T2.Title
WHERE T.Title LIKE CONCAT('%', T2.Title, '%')
ORDER BY Title

And here is the SQL Fiddle.

like image 29
sgeddes Avatar answered Oct 03 '22 20:10

sgeddes