Good mysql query to find similar values in a single column

Question

I have duplicate entries that are highly similar, but not exact. Here are some examples:

- 2016: Obama's America
- 2016: Obama's America (VF)

- Hurt Locker
- The Hurt Locker

What would be a query that I could use to get potentially similar titles ?

Update

Please note that I am not trying to remove EXACT duplicates. I am only trying to select similar values in a single column.

Bulat · Accepted Answer

I think this can be solved by measuring the distance between strings with some string metric.

Levenshtein seems to be the most well known metric and I have used some implementation of it in Oracle. It is implemented for MySQL also. You might find some other metric that will work better for you.

sgeddes · Answer

Not sure this is the best way or most efficient, and it definitely depends on the meaning of similar. If the meaning is the title contains all of the text in one row but some of the text in another row, then something like this should work:

SELECT DISTINCT T.Title
FROM YourTable T
   LEFT JOIN YourTable T2 ON T.Title != T2.Title
WHERE T.Title LIKE CONCAT('%', T2.Title, '%')
UNION 
SELECT DISTINCT T2.Title
FROM YourTable T
   LEFT JOIN YourTable T2 ON T.Title != T2.Title
WHERE T.Title LIKE CONCAT('%', T2.Title, '%')
ORDER BY Title

And here is the SQL Fiddle.

Good mysql query to find similar values in a single column

Tags:

sql

mysql

similarity

Update

David542

2 Answers

Bulat

sgeddes

Recent Activity

Donate For Us

Good mysql query to find similar values in a single column

Tags:

sql

mysql

similarity

Update

David542

2 Answers

Bulat

sgeddes

Related questions

Recent Activity

Donate For Us